AI Engineer Singapore · Day 1 · 16 May 2026

The Year of the Harness, Decoded

All 33 talks from AIE SG Day 1 — summaries, slides, transcripts, and the operator-mode read on why each one matters. Use this as your aide-mémoire long after the conference.

33 talks 10 hours of content 731 slides captured 70,000+ words of transcripts
Day overview

What happened on Day 1

AI Engineer Singapore Day 1 (16 May 2026, Capitol Theatre) was the first AIE landing in Asia and immediately functioned as a coronation event for the harness layer. Across 33 sessions, the strongest signal was that the model-vs-harness debate is over: every serious operator on stage — from OpenAI's Codex lead to Stripe's coding-agent team to GovTech Singapore — spent their slot describing what sits above the model (sandboxes, judges, skills, evals, playbooks, identity, observability) and treating the model itself as commodity infill. The conference's center of gravity was the agent-platform stack and its enterprise deployment; the closer — swyx's "Agent Lab Nation" — explicitly named the year-of-deployment / decade-of-agents arc and announced Cognition's Asia HQ in Singapore via the Havana acquisition. The day's energy was less Bay-Area-evangelism and more APAC-operator-pragmatism: how do you ship agentic code into a 2M-line COBOL bank, route inference across 15 regions for data residency, or run a Foreign Minister's diplomatic prep on an 8GB Raspberry Pi.

Key themes

What every builder should leave with

Harness-as-moat moved from thesis to consensus

Almost every speaker named the orchestration layer (Skills, MCP, sandboxes, evals, playbooks) as the durable surface — explicit in Vivian, Feng GovTech, Cohen NanoClaw, Sottiaux Codex, Jukic Daytona, Doyle Stripe Minions, Cursor 3, Sakana, and swyx's close. When the policy minister, OpenAI, Stripe, Cursor, and a Tokyo sovereign-AI lab all describe the same architecture, the bet is no longer contrarian.

Singapore is the APAC harness-deployment hub

Cognition acquired Havana and chose SG as Asia HQ; GovTech is building a sovereign agentic harness with versioned Skills for 150k officers; Featherless raised USD 120M; Sea Limited will run a Codex hackathon on 6 June; Menlo's open-source humanoids hit USD 1M pre-orders out of SG. swyx explicitly told the room to 'give up on the government' and run private-sector-led.

Open models crossed the Sonnet/mini line and approach Opus

GLM 5.1, Qwen 27B and Gemma 3 1B, and DeepSeek-V3-class MoE models collectively reset the assumption that frontier capability requires API access — laptop-runnable models now beat GPT-4 on coding. Z.ai is a diamond sponsor with a credible Opus-class open model; Featherless serves 30k of them as a router.

Security and sandboxing went from afterthought to architectural primitive

Cohen, Jukic, Zhang Sonar, Simular, and Sottiaux's Auto Review all converged on the same pattern: separate tool call from tool execution, never give agents real credentials, treat the codebase itself as attack surface, run an adversarial judge against worker output. Daytona cited 800+ malicious skills found in early-2026 audits — agent-economy supply chain risk is now real.

The SDLC bottleneck moved from build to review

Sottiaux's frame — agentic coding widened the pipe so planning, review, validation, CI, deploy, debug are the new constraint — was echoed by Lai (agent-unfriendly errors are dead weight), Greptile (27.6% of April PRs are agent-authored with model-specific bug fingerprints), Sonar (AI catches only 41.5% of human-flagged issues), and Stripe Minions (3000 PRs/week at 65% one-shot merge).

Multiplayer / canvas / voice surfaces are the next harness frontier

Aosheng Ran's Figma multiplayer canvas, ElevenLabs' turn-taking voice engine, Cursor 3's 'glass not black box' UX, and Bland's voice failure-mode taxonomy all argue that chat is the email of agentic software. The shape of the next defensible product is collaborative, embodied, multimodal.

Long-horizon agents = depth, not duration

Zixuan Z.ai's three failure modes — goal drift, error accumulation, no pivot — recur in Prime Intellect's RLM framing and Stripe's clean-context LLM judge. The pattern: agents that succeed over hundreds of rounds do so via forced goal rereads, file-handle discipline, and judges that can't be talked into 'done.'

Sovereignty redefined as deliberate layer-ownership

Sakana's Druga gave the cleanest articulation: no country owns the full stack; sovereign AI = picking which of data/eval/adaptation/routing/governance/physical you control. GovTech and Cerebras operationalize this — post-training and routing as the practical sovereign control points, not pre-training.

Where the market is going

The operator-mode read

Enterprise buyers showed up paying for two things: agent platforms with deterministic guardrails (Cognition Devon, Stripe-style Minions, Cursor 3, Codex Auto Review) and sovereign-grade inference/post-training stacks (Groq's 15-region data-residency play, Cerebras' MoE wafer pitch, Sakana's MUFG/SMBC credit memos, Featherless' router). The pure-model layer is commoditizing fast — Z.ai's GLM 5.1 ships into the same harnesses as Anthropic and OpenAI, and Featherless' thesis is "30,000 models, no preferred one." Capital is flowing to the layer that captures the workflow: $120M to Featherless, an undisclosed Cognition spend on the Havana / Asia HQ play, $100M+ AI budgets at named APAC banks. Categories accruing value: orchestration runtime (Daytona, NanoClaw, Cognition Devon), eval infra (Greptile, Sonar CRAP, Frontier Suite), domain-bound post-training (Sakana, Prime Intellect), and the legible "glass not black box" UX layer that makes agents enterprise-shippable (Cursor 3, Stripe Slack-first deploy).

Anchor talks

Top 5 talks to watch first

  1. swyx — The Agent Lab Nation Anchor closer. Cognition picks Singapore as Asia HQ. The strategic frame for the year ahead.
  2. Dr Feng Yuzhang — AI Native Government Singapore explicitly building the sovereign agentic harness (MCP gateway + Skills platform + identity + memory) for 150k officers.
  3. Dr Vivian Balakrishnan — Second Brain Workflows Foreign Minister running NanoClaw + Neman + Obsidian on an 8GB Raspberry Pi. 'Tools matter more than models.'
  4. Stefania Druga (Sakana) — Sovereign AI Six-layer sovereign stack (data / eval / adaptation / routing / governance / physical). MUFG/SMBC credit-memo case study.
  5. Mark Doyle (Stripe) — Minions Payments-grade coding-agent infra. LLM-judge-with-clean-context primitive. 65% one-shot merge rate at 3000 PRs/week.
Speaker directory

All 33 speakers · click to jump

#00
Sherry
65 Labs
AI Engineer Singapore Day 1
#01
Dr Vivian Balakrishnan
Singapore Minister for Foreign Affairs
AI Adoption, Personal Understanding, and Accountability
#02
Gavriel Cohen
creator of NanoClaw / NanoCo
Guardrails, Container Isolation, and Safe Autonomous PR Agents
#03
Thibault Sottiaux
OpenAI
Codex across the software lifecycle, agent reviews, and approval fatigue
#04
Dr Feng Yuzhang
Head of AI
Singapore's Sovereign Agentic Harness
#05
Phil Hedayatnia
co-founder
Design intent, human taste, and keeping AI products from flattening into sameness
#06
Annie Luo
UX Researcher
Designing AI for Subjective Decisions in Shopping & Travel
#07
Jimmy Lai
Vercel
Docs, Errors, and the Cost of Ownership
#08
Vedran Jukic
CTO/co-founder Daytona
Why Autonomous Agents Need Sandboxes, Isolation, and Strict Boundaries
#09
Vaishant Kameswaran
Co-founder/CTO
What 5 Million Vibe-Coded PRs Reveal About Agent Bug Profiles
#10
Yuntong Zhang
Sonar; final-year PhD
Remediation Loops and Executable Evals
#11
Eugene Cheah
Founder/CEO Featherless AI
Open Models Are Here — Why Singapore Should Just Build
#12
Max Buckley
Head of Knowledge Research
What Comes Next When Software Scarcity and Coding Costs Collapse
#13
Mark Doyle
Stripe
One-Shot Coding Agents at Stripe
#14
Li Hau Tan
Member of Technical Staff
Reliability, Trust, and Scalability in Computer-Use Agents
#15
Ryo Lu
Cursor
Designing Cursor 3, Baby Cursor, and software that feels like play
#16
Aosheng Ran
Product Designer at Figma
Multimodal, Multiplayer AI Canvases for Divergent Co-Creation
#17
Selim Arguel
Menlo Research
Sim-to-Real Pipelines, Vibe-Coded Robotics, and Open-Source Humanoids
#18
Alberto Taiuti
CEO & Co-founder
World Models, Real-Time Video, and Generative Software Primitives
#19
Jan Liphardt
OpenMind
Embodied AI and Social Intelligence
#20
Andrew Tan
Platform Engineering Lead
Low-Latency Inference, Custom Silicon, and Global Routing
#21
Daria Soboleva
Head Research Scientist
Scaling MoE Training Past GPU Communication Bottlenecks
#22
Zixuan Li
Z.ai
GLM 5.1, Open Models, and Long-Horizon Task Reliability
#23
Boris Starkov
ElevenLabs
Speech engines, turn taking, and conversational voice agents
#24
Jackman Ong
Founding Research Engineer
Recursive Language Models, Memory, and Long-Running Agents
#25
Michelle Julia
Co-founder
Architectures for Long-Term Relationship-Aware AI Agents
#26
Jacky Mok
Head of Applied AI at Reka
From Language Models to Physical Intelligence and World Models
#27
Gokul Srinivasan
Co-founder
Simulation, Games, and Faster Robotics Training Loops
#28
Wei Wei Hsu
Wayi
Eastern Product Building, Cultural Context, and Local Storytelling
#29
Anun Joshi
Bland
Lessons from Scaling Bland to Millions of Calls
#30
Linh Nguyen
Obello
Editable Brand Systems and AI Creative Ops
#31
Stefania Druga
Research Scientist at Sakana AI
Local Model Ecosystems and Japan-Specific Deployment
#32
Sean 'swyx' Wang
Founder
Harnesses, Evals, and Enterprise Deployment
Full coverage

Every talk, decoded

Each section: summary, why it matters, slide gallery, slide notes, and full transcript. Click any slide to enlarge or jump to that exact moment in the YouTube video.

Opening Remarks: AI Engineer Singapore Day 1

Sherry, 65 Labs (grassroots builder collective, Singapore)

Summary

  • Sherry of 65 Labs opens AIE Singapore Day 1 at Capitol Theater, framing the conference as a grassroots builder collective's 'love letter to the ecosystem' rather than a corporate event.
  • 65 Labs is volunteer-run by full-timers; frontier AI teams moved from giving hackathon credits to showing up in person — Swix (AIE CEO, Singapore-origin, met in NYC) brought the global AIE format to SG for the first time.
  • Day 1 ran 20 workshops across 5 concurrent rooms plus a leadership track; 98% of attendees showed up for laptop-open workshop day; community sponsored 20 student scholars.
  • Programming: 60+ talks across three themes over two days; venues are Capitol Theater (talks), Pullman and Atelier at Kinsky (two curated expo areas), plus 'the cave' — a sound-reactive vibe-coded decompression room.
  • 65 Labs vibe-coded the conference map and published a public API on the program so attendees can build their own scheduling tools.
  • Sponsors: OpenAI and ZAI (diamond); Google DeepMind, Arise, Cursor (platinum).
  • Teed up the keynote from Singapore Foreign Minister Dr Vivian Balakrishnan, citing his viral 'second brain' tweet as proof the policy class is hands-on with the tools.

Why this matters

When OpenAI, ZAI, Google DeepMind, and Cursor all show up in person to a grassroots builder collective's conference, and the host country sends a sitting Foreign Minister to keynote, Singapore stops being a satellite stop and becomes a co-equal node on the global AI-engineering map. The signal for builders: policy layers and frontier labs are converging on the same builder communities, and the deployment-at-scale story is increasingly happening outside the SF Bay Area.

Slides (2)

Slide notes

[t=36:44] Speaker shot — no slide content. Static event title card visible on screen: "AI Engineer SINGAPORE" logo with line-art Singapore skyline (Marina Bay Sands, Singapore Flyer, Gardens by the Bay supertrees). Lower-third chyron: "Sherry Jiang / AIE Team / Kick-off". "Presented by 65 labs" badge in corner. Speaker at Capitol Theatre podium.

[t=39:23] Speaker shot — no slide content. Same static event title card and chyron as above; Sherry speaking into handheld mic at podium.

Full transcript (1363 words)

Joining at 8:40 a.m. On a Saturday um for day two of AI Engineer Singapore. Uh just as a way of introduction, I'm Sherry and I'm one of the members of 65 Labs. Uh we're one of the largest grassroots builder collective um here in Singapore. Uh an article actually recently came out about us this week. Um it's just a few of us who are actually doing all of this in our spare times. Uh we all have full-time jobs, but you know, this is something that we are all very very passionate about to bring to Singapore. So this conference really is our love letter to the ecosystem. Now, somewhere along the way of all these hackathons and bill nights we are running, um something magical really happened.

Um some of these frontier AI teams that are here today building the models that you use every day, uh started to show up for our community. Um they're giving credits for our hackathons. Uh even showing up late on Zoom to do workshops for people and you know we've really had all sorts of people um being supported from as young as 13 years old to folks who are in their 60s just learning all of this as well. Um it really is just a great time to kind of come together and be a builder. But what we got was more than just a few credits. Uh we started actually to build a relationship uh with some of these teams and uh that is the magic that we see in this room today.

Now, you might think that this is the first time all this is happening in Singapore, but it's really been happening under the surface for a long time before we gap gathered at the Capitol Theater here today. Um, so that's why it was no surprise when some of our speakers actually told us their entire plane from San Francisco to Singapore was actually just full of folks uh coming down for AIE. So, how did this conference actually happen? Um so uh we actually met Swix uh who is the CEO and uh co-founder of uh AIE Globally. Um met him actually in New York City and uh I don't know if you guys know but he's actually from Singapore originally.

So it just clicked and made sense. Um we've been working with a lot of these teams uh remotely and we just wanted to bring them in person in Singapore all together for the first time. So to hear a little bit more about the AIE story, uh, Swix will be, uh, speaking about cognition, but also be closing out day one to hear, uh, to share a little bit more about the AIE All right. Um, now, show of hands. I'm kind of curious who was all at the workshops yesterday. Woo. All right. That is pretty much like 98% of the crowd. Um that's awesome to see because that was a laptop open day and that's what we wanted to do differently about this conference where we're not just yapping about stuff but we're you know building and uh apping in a way.

So we wanted to make sure that you know all of this is designed for for practical knowledge right and uh yesterday uh just so you know we actually had 20 workshops running five rooms concurrently and an entire leadership track as well. So, um, we really wanted to put programming at the absolute heart of everything here. Um, so we're all learning and building alongside each other. And because we want to create this place for learning, um, we also wanted to give the opportunity for the next generation to get this experience. So, uh, we were actually able to come together as a community to support 20 students who are actually scholars here today at the conference.

So, um, would you stand up and wave? So, every one of these tickets were actually partially or fully sponsored uh by builders in our community um who believe that this is what Singapore's AI future looks like. So, what can we expect over the next two days? Uh we didn't just you know string together a bunch of talks to you know keep you guys here for lunch and things like that but we wanted to bring you the kind of conversations that you can't just Google or find in Corsera but actually be in the thick of these conversations that are happening with the people who are actually building with these tools whether it's on Twitter or research papers and all of that so we want to bring these conversations so you can be in the middle of them as well and we know that there's a lot going on uh there's over 60 talks actually within the next two days across across three themes and you may need some help kind of navigating and figuring things out.

So, uh we actually have a guide that we sent out an email that has a map that we actually vibe coded as well as a complete program list as well. Uh not only that, we created an API that you can actually use that's publicly available. So, you can actually build your own tool on top of the program because we didn't just want to create like a one-sizefits-all app that you download and you use. We wanted to create something that you can actually build for yourself because that's kind of the spirit of what we do. So, AIE today and tomorrow is going to be spread across different spaces. Uh, this theater right here in Capitol Theater is going to be where all the talks are going to be happening.

And once you kind of get that inspiration for like, you know what, I really want to go talk to this team. That was super interesting. We have two expo areas set up. Uh one is going to be in Pullman across the street and the other is the Attelier in Kinsky. And these are again not just booths that people are just setting up um just because but we actually made this a curated space where you can actually have face-to-face conversations with the very people who are building the tools that you are And then finally, we also wanted to make sure that we give you space to uh you know decompress and touch grass as well because it's a long two days.

So, uh, we're happy to share that we're going to have a 15-minute break session where there's going to be an experiential space, uh, called the cave, which is a re immersive sound reactive decompression room that was actually entirely, uh, vibecoded by the creator. You'll also find a lot of folks running around with red shirts. Um, these are our amazing volunteers who will be helping you navigate the spaces and make sure that you get the most out of every conversation happening here today. And we'll not be in this room without our incredible sponsors. So, um, our diamond sponsors are OpenAI and ZAI.

And our platinum sponsors are Google Deep Mind, Arise, and Cursor. And thank you for the uh thank you to the Capitol Theater for providing us with this beautiful space. Now the Singapore story has always started with the builders. Uh which is why a few weeks ago uh it really blew our minds when um our very own minister for foreign affairs Dr. Vivian Bala Krishnan uh went viral for a post on Twitter for building his own second brain. Uh and it makes sense because he has a role that demands navigating large volumes of information and rapid context switching. So his reflections on building this kind of workflow and tool for him really underscores that meaningful conversation about AI should involve understanding the tools themselves and not just thinking about you know the the abstracts of it.

Um, and with that, uh, I am absolutely honored to introduce our keynote speaker and a builder himself, Singapore's Minister for Foreign Affairs, Dr. Vivian >> Use this.

Second Brain Workflows: AI Adoption, Personal Understanding, and Accountability

Dr Vivian Balakrishnan, Singapore Minister for Foreign Affairs (retired eye surgeon)

Summary

  • Vivian Balakrishnan (SG Foreign Minister, retired eye surgeon) built a personal agent over 3 months on a 2-3 year old Raspberry Pi with 8GB RAM — proof the agentic-tooling barrier has collapsed.
  • Three anchor messages: (1) you can outsource computation, memory, and dissemination but not personal understanding or accountability; (2) value lives at the workflow/individual layer (cites Neil Lawrence FT letter), not in frontier model capex; (3) tools matter more than models.
  • Stack: NanoClaw (chosen over OpenClaude for security, containerization, short readable codebase, no config files), Baileys for WhatsApp bridge (flags likely Meta ToS issue), Whisper voice in/out, Neman graph memory (entity/causal/temporal/semantic edges), Ollama local embeddings, Karpathy-style LLM-supervised wiki, Obsidian over Apple iCloud as personal cloud UX.
  • Uses it for diplomatic prep (12 countries this month, hundreds of meetings), first-draft speeches, parliamentary Q&A, and the slides for this talk itself.
  • Threat model: only loads already-public foreign policy material, so a breach exfiltrates nothing beyond his phone number.
  • Caveats: token pricing is currently a subsidy; LLMs aren't always right — deterministic systems, expert/rules systems, and eventually neuro-symbolic architectures (nods to LeCun) matter.
  • Closes with DPM Gan policy line: SG won't lead on model development, will lead on deployment at scale — ground-up decentralized push.
  • Direct ask to NanoClaw creator Gabriel: make all models first-class citizens by June 15.

Why this matters

A sitting Cabinet minister running a personal agent stack (NanoClaw + Neman graph memory + Ollama embeddings + Obsidian + Baileys + Whisper) on a three-year-old Raspberry Pi is the strongest possible proof that the bottleneck has moved from frontier models to assembling the tooling around them. Vivian Balakrishnan's three anchors you can outsource computation but not understanding or accountability, value lives at the workflow layer (citing Neil Lawrence), and barriers to agentic tooling have collapsed is a government-level endorsement of the 'tools matter more than models' thesis, and his deployment-at-scale framing locks in Singapore's official AI position.

Slides (14)

Slide notes

[t=47:20] Slide: "A second brain for a diplomat" (subtitle: Assembled, not written – Nanoclaw running on a Pi 5)

  • Three-stage pipeline: 01 INGEST (Raw sources: transcripts, articles, web clips, voice notes) → 02 EXTRACT (mnemon graph: synthesised facts, SQLite + local embeddings) → 03 SYNTHESISE (Wiki pages: narrative knowledge, UX in Obsidian)
  • Assembled from open source: Claude Agent SDK, Baileys (WhatsApp), mnemon (graph), Ollama + nomic-embed, whisper.cpp, OneCLI
  • Tagline: "Tool assembly by me. Everything else, written by someone else"

[t=58:41] Slide: "Learn by doing."

  • THE CLAIM: "You cannot govern a technology you have only been briefed on. Reading the executive summary tells you what the technology does. Building with it tells you where it breaks, what it costs, and what it cannot yet do."
  • WHAT I FOUND (three bullets):
  • Context windows are the budget — every token costs money and attention; you design around them or you don't
  • Tools matter more than models — the model is a commodity; what you wire to it is the product
  • Memory is the unsolved part — stateless chat is a dead end for real work

[t=53:00] Same "A second brain for a diplomat" slide held; speaker gesturing mid-stage.

[t=65:46] Stage wide shot — AI Engineer Singapore title card with city skyline and AIE letters on stage; no slide content.

Full transcript (2722 words)

>> Hi, good morning everyone. You know, we can be a bit more informal in Singapore. So, good morning. I know it's raining, but Singapore's usually sunny. Um, I feel like an impostor here. Uh, for those of you who don't know me, I'm actually a retired eye surgeon. Took a detour into politics for perhaps too long. Um, but I've always retained an interest in getting things done, building things, fixing things. And since I don't get to operate on eyes anymore, uh I assemble watches, I reprogram appliances, and now there's some other stuff which is this which is what I'm going to talk about today.

But actually, I wanted u to explain why I did it and the implications of this. And I think with this audience, you'll get it straight away. But let me jump to the end. Um and to say these are the three key messages which you can forget everything I've said but just bear these things in mind. We're now at an age when you can outsource a lot of stuff calculations computation memory dissemination of knowledge. The one thing which you cannot outsource is your personal understanding. And if you are in a position of authority, you can delegate work. You can't delegate accountability. So remember the personal element in that understanding and accountability.

The next point and I would refer you to a nice short letter published in the Financial Times by Professor Neil Lawrence, University of Cambridge. She's the professor of of machine learning. And you know there's a lot of hype about data centers, top-down systems, rules, governments. That's macro. But his hypothesis is that real value for the economy and society is created at the ground level workflow by workflow sector by sector department by department and in fact at the individual level. What this means is that it's look, I know you guys are great and I know the guys working on frontier models are incredible, but the real payoff is when ordinary people, teachers, lawyers, technicians, managers, doctors, lawyers or even ministers are actually using the tools which are already available, already invented.

People who know their jobs and are empowered by these tools. That's how you create real value for society and for the economy. So I'm looking at decentralization, individualization, bespoke models. I'm talking about making yourself better at what your day job is and even better still re-engineering the workflows of your life. That's where the real value boost is. And the third and that's why I'm making this presentation is that I sincerely believe the barriers for achieving all this have collapsed. The tools have already been It's a matter of getting people to understand what tools are out there, assemble their own tools and put ourselves on a completely different trajectory.

Okay. So now let's do the fun part as to what my adventures began. Now my personal agent first came to life almost exactly three months ago. Uh yes, I got caught up by the open claw uh hype but immediately given my job I knew that was not practical because security was an issue. And then some someone else then pointed to nanoclaw and I think we are going to hear from Gabriel after this where you know and as a geek and as a tinkerer myself I like stuff which I can grasp. So the fact that nanoclaw has a very short code base which even an idiot like me can read and sort of understand the fact that it's containerized and as a surgeon I know that there's no such thing as a routine operation and things will go wrong things will break and when they do break hopefully you want them to break within barriers.

So the containerization part, the understandability part was vital for me. Anyway, simple go to GitHub, download the stuff. And the other attractive part about it is there no configs. There's in fact because you rely on LLM to do all the bespoke tailor customizations. In fact, you realize everyone running an instance of nanoclaw is running an individualized system. Now, that's both good and also has its share of complications. But anyway, so let me tell you what I did with it, right? So, Nano Claw provides the platform. It allows me to communicate through WhatsApp with my agent. That part's not rocket science.

The thing which if I could go back one slide the thing which I was really after was how could I use it for my daily life. Let me give you an idea of my daily life. This month I'm visiting 12 I have I will therefore have to meet hundreds of people. I will have to understand the country's economy, culture, history. War and peace. I need to know people as individuals and not just something I from a brief and there's a huge huge cognitive overload on every single diplomat. And the question is how can I turbocharge this process so that if I need a fact or a I can get it I can get it anywhere and I can go down the rabbit hole if need be.

So it's got to do with this whole are useful for analysis, for abstraction, for expression and certainly for drafting briefs, drafting speeches, formulating answers to questions in including, I must add, parliamentary questions. And three months ago which includes the whole debates in in parliament. Uh it was extremely impressive to see both the questions and the answers which generated and uh with due respect to all my colleagues in parliament uh some of the AI generated debates uh far more incisive shall I But anyway so it communicates with me through WhatsApp. So there's this bit of software called Bailey's.

I suspect it's probably uh not entirely in keeping with what uh Meta or WhatsApp would like us to do because it's actually simulating, you know, the way we get WhatsApp to work in our browsers or in on our laptops. So it's it's a pseudo terminal in a sense. The bit which I believe is the real frontier for people like me is memory and fortunately I came across this obscure piece of software called Neman. I still haven't met the developers so I don't really know but a memory system with graphs. So it's got The edges are entities, causality, temporal relationships, and semantic. And also because I didn't want to be confined to just keyword searches.

The fact that I could run Olama locally with an embedding model means I also have semantic search built in. So with these elements, I mean, whisper is the part that's easy because with WhatsApp, I didn't want to only have to type. I wanted to be able to speak and he can speak back to me. And of course, my dream uh is one day to just have my agent answer supplementary questions in Parliament. I'm not sure about the legality of that, but if it happens, you you'll know that I shared the idea with you first. But the point is I was now able to speeches, transcripts, particularly of my own contributions, get it into the system, digested, put into that memory database.

And then around the same time, Andre Kapati came up with his LLM supervised wiki generation. So I added that in as well. And then for the UX, the user interface, I used Obsidian partly also because Obsidian allows me to use uh the Apple iCloud and that therefore immediately means I've got a personal cloud and all the wikis which are extracted from this personally curated database becomes available to me because remember I started off by saying the key is personal understanding. So I've got a memory system, I've got a communication system, I've got an analysis system, but all nice in theory. But what I here to share with you is that in last three months, I found it incredibly useful meeting people, traveling, first drafts, first cut of a speech.

Even today's presentation uh even the slides actually were generated by claude you know it's the pace at which things can be done and as a practitioner so not as an engineer but as a practitioner with a day job it's useful and I can attest to it usefulness because I can honestly tell you I have not dared to switch it off and Nano Claw unfortunately well has moved from version one to version two when version two came on because their transition is not at all smooth I've left version one and I put version two on another computer and and I should also add all this stuff one of that my my most daily used agent is running off a Raspberry Pi which is at least two or three years All it has is 8 GB of RAM.

You you see my point about relevance, use. Let's let's go on to the next slide. And this is my point. The barriers have fallen because I did this. I did this without writing Claude, or the credentiing system. You know, there's this whole thing about vibe coding. I won't even dare to claim I was vibe coding. I was just assembling you it's just tool assembly and so I I I should actually change that line I didn't write any glue I can honestly say yes I have gone through the you know the nanoclaw insists that you approve every time you give bash access to the agent so I do scan through it I and it does help here it does help if you don't understand coding so you understand what's going on even if you're not actually typing and editing code in the raw.

Next, in a sense, my approach to all this has been to learn by doing. It's not enough to sit down and read, get the headlines, get the summaries done. If you're interested in anything, get your hands Learn and you learn best by doing. And because the barriers for entry have been have come down so dramatically, everyone should em embark on their personal on their personal exper And you know Claude came up with this quote which I got a bit suspicious about. You know who has said it before. Says it. It claims no one else has. But actually, I kind of agreed with it and this is a shout out to my government colleagues.

You cannot govern a technology that you have only been briefed on. You better get your hands dirty and then you understand both the potential and the limits and the a few other digressions down here. Um there are constraints. So for instance depending on LLMs and quite frankly I mean I'm the prices for which um AI majors are currently charging us. I think we all know we're enjoying in effect a subsidy. Tokens are not cheap. Compute power is limited. Electricity prices have risen. Wars do not help. And we should beware of just trying to throw every problem and every step in a solution at an LLM.

It reminds me of the old proverb, you know, for a man with a hammer, everything looks like a nail. And in fact there are good both economic and design advantages so that you use LLMs but do not forget there is still a role for deterministic systems. There is still a role for expert rulebased systems and my my personal belief as a biologist in the end some kind of neuro symbolic system rather than just uh the LLM model and I have some sympathy for Yan Leon who says you know I think that LLMs are great but actually that's not the way we've solved it in nature. If you look at the human brain, actually I suspect we have less of computation in the human brain than in many of the large language models which we have today.

And I can tell you as an eye surgeon, the cortical computation for vision, for language, for cognition are often based on far more efficient structures than the energy gobbling systems which we have today. The point I'm making and where I'm agreeing with Yan Lun, you know, is that in the end these are pattern recognition systems with with memory. And out of what looks like simple fundamental abilities is emergent behavior which gives you conceptual understanding which gives you language which gives you the ability to do things. So my point is this is a field which is still exploding and therefore approach this with humility.

Approach this by just doing your best, improving the productivity of your daily job, but understand that actually we are perhaps one of the most privileged generations to be living through a Tools matter more than models. And I think um Gab will know I've told him by June I think it's June the 15th I need nanoclaw to make all models first class citizens. Uh there are reasons for that which we can discuss later. And then finally It is a very human and I think it is the great unsolved part of this frontier. Next slide which I think on security I'm not going to belabor this. Uh just as an aside even if you hack my system uh the most thing you'll get from it is my phone number.

Uh you will get summaries of foreign policy but since it's foreign policy which I have espoused and in any case I have curated the stuff I've put in uh even if you take my system I think it will generate the foreign policy of Singapore anyway now that's one way of addressing security by making sure you only put what is already open source what is already published and you subject your systems to a level of transparency and scrutiny that can be withstood. But do not forget security remains paramount and in fact the complication to the dissemination of AI is going to be commercial national security, cyber security and the superpower contestation.

These are the political factors that are going to affect the the speed and the dissemination of AI of the future. This again is a separate political talk well worth a deep dive. And next slide and I hope that is my last slide. So the goals I'm a believer in deployment at the I'm a surgeon. I believe in doing. I believe in fixing. I believe that's where lives are safe. Value is created. Second, therefore the public policy goal is the democratization of these tools. And that's why you will see in the economic strategy review committee DPM gun said we are Singapore is not likely to be at the frontier of model development.

But we can be at the frontier of deployment at scale. So and therefore if that's what we believe then it must be a decentralized groundup approach and that's why I'm here today because I found out this or conference was organized less than three months ago. 65 labs. All the people you meet here, this is all not even their day job. It's a hack, right? But this is the way this is the way I believe the future is going to be created. So, thank you all for being here. Thank you for part of this journey. Have a wonderful day, a wonderful future. Thank you very much. You should have given this. >> I I should have worn this before.

You should have given it to me before. I would have won. >> I wasn't We weren't brief. But thank you so much. So much. >> Thank you. You got to make an announcement, right? I let her know that. All right, everybody. And um I'm super excited to be introducing our next speaker, none other than the creator of Nano Claw himself of Nanoco, Gabrielle Hi everybody. Really excited to be here. Just getting things set up. Just need your sites to load then it >> should try to go to hospital. >> Can you put your mics and just can you m Wait, it's loading now. It's gone. It's getting there.

There we go. This one just got right here. Hi everybody.

NanoClaw: Guardrails, Container Isolation, and Safe Autonomous PR Agents

Gavriel Cohen (creator of NanoClaw / NanoCo)

Summary

  • Gavriel Cohen introduces NanoClaw, an open-source framework for secure autonomous Claude assistants — 30K stars, 12K forks, 2.5K PRs/issues in three months.
  • Core demo: a Slack-hosted 'agent factory' that triages, reviews, plans tests, spins up VMs, and merges every incoming PR — operating on unsanitized adversarial input (PRs from anyone, including agentic spammers).
  • Doctrine: instructions are for steering, never security; treat agents as operating in 'enemy territory' that can be turned into a double agent at any moment.
  • Three architectural primitives: (1) container isolation so the agent can't touch router, Slack bridge, or host VM; (2) credentials kept outside the agent — every outbound request proxied through a vault (partnered with 'one CLI') that swaps a literal 'placeholder' token for the real secret only if policy permits; (3) separation of tool call from tool execution — agent emits an intent (e.g., a gh CLI command), host process surfaces it as a human approval prompt in Slack, privileged action runs outside the red zone using the approver's credentials.
  • Side patterns: per-agent containers (multi-tenant by design), test orchestrator that SSHes into a fresh VM and prods the agent through real Telegram messages, persistent Slack threads per agent.
  • 'Talk to your supervisor' feedback loop: a supervisor agent proposes instruction/skill edits — the factory improves itself.
  • Closes by handing the audience live Telegram access to his personal agent (calendar, email, drive) via QR code, demonstrating trust in the isolation+vault+approval stack.

Why this matters

NanoClaw hitting 30K stars and 12K forks in three months while shipping a Slack-hosted multi-agent factory that auto-merges PRs from adversarial input is a working blueprint for the orchestration layer above the model. The doctrine instructions for steering only, never security; vault-injected placeholder tokens; separation of tool call from tool execution with human approval; container isolation as table stakes is the canonical design pattern for any agent that touches money, identity, or production systems, and the supervisor-suggests-edits loop is a glimpse of how harnesses start to improve themselves.

Slides (12)

Slide notes

  • [t=78:19] Slide: "Credential handling / With Vault" — hand-drawn architecture diagram. Outer box labeled "NanoClaw" contains: Slack App \<-\> Slack Bridge \<-\> Router \<-\> Sandbox (containing red-outlined "Agent"). Right side: Agent \<-\> Agent Vault (OneCLI), labeled "No Credentials". Agent Vault then connects down to red-outlined "External Service" box, labeled "With Credentials". Pattern: agent never holds credentials directly; vault brokers calls to external services.
  • [t=82:51] Screen-share of a Slack/PR-review thread (pr-factory channel). Visible PR titles in left rail: "...mmunity): add agent-browser-based LinkedIn" (-0 votes, red triangle) and "...s): add LinkedIn Ads playbook skills for Ads" (-0). Right pane shows reviewer comments enumerated:
  • Point 2. "Carve out at the call sites — in setup/register.ts and scripts/init-first-agent.ts, skip calling namespacedPlatformId when channel === 'cli'. Keeps the function pure but scatters the exception across every setup path. Any future setup script that calls the function for CLI would silently re-introduce the bug."
  • Point 3. "Rename 'local' to something shaped — give the CLI a platform ID that already trips one of the existing guards (e.g. make it start with +, contain @, or start with group:). Absurd in practice — it would be a hack to exploit an unrelated heuristic."
  • Verdict: "The PR's approach — centralize the exception in namespacedPlatformId, parallel to the existing deltachat carve-out — is clearly the right call. It's the single point of truth, consistent with the existing pattern, and future CLI setup paths get the right behavior automatically."
  • Draft reply being typed: "Makes sense. Let's do more comprehensive testi[ng] because this touches a cri[tical]..."
  • UI: "Also send to pr-factory" checkbox.
  • [t=73:48] Title card: "This is prompt injection ground zero." — red/black explosion graphic. Speaker shot.

Full transcript (2490 words)

I'm uh Gabriel Cohen and I created NanoClaw. I have uh in my telegram right now a AI assistant that's connected to my emails, my calendar, uh connected to my call notes. Uh it has access to sensitive information. It can take sensitive action like reading my emails, sending out an invite. Uh at the end of this presentation, 15 minutes, I will give everybody here access to talk to it freely. Um, and I can do that and I'm not crazy and that's not dangerous. And throughout this talk, I want to explain to you a few concepts about Nano Claw that make that safe. Um, and uh, to demonstrate those concepts, I'm going to talk about our um, agent factory that we built and along the way I'll share some things that I think are interesting about choices we made while building it.

So first of all, NanoClaw is an open-source framework for building uh secure autonomous assistance or claw assistance. Um in just three months, we have over 30,000 stars on GitHub and uh many thousands of users all over the world including uh Dr. Vivian Balakrishna, foreign minister of Um, more importantly though than stars on GitHub, over 12,000 people have forked the repository and that's the main way people are using it. They're forking it, experimenting with it and making their own autonomous agent based on nanoflow. Uh, together with that we have over two and a half thousand uh pull requests and issues.

So maintaining an open source project today, there's never been a better time to build open source projects. At the same time, there are new challenges with uh coding agents. It's easier than ever to open a poll request. Um, and many people, many thousands of people are making great contributions to the project. Uh, but there are frankly also spam pull requests. People will point their coding agent at a repo and say, "Contribute something here." It is very difficult to tell the difference between a spam pull request and a good pull request today. They look the same. They can have similar amounts of uh code and telling the difference comes down to a deep understanding of the project and the direction of the project, the So, we built to help us sort through these poll requests, we built an agent factory uh that helps us review every single contribution.

Uh this is our agent factory. It's in our Slack. It's hosted on an xie.dev uh virtual machine. Um, every single PR that's open in GitHub fires a uh web hook um that creates a new thread in our Slack. A review agent first triages and then does an in-depth review. Uh it then gets passed on to testing first creates a testing plan uh for in-depth testing, real life testing, not just automated tests. Um and then once we approve the plan, it get a new VM is spun up. It goes through a whole series of tests and then uh once it's done we can merge it directly within the factory and it goes live. So half of you are probably looking at this going amazing I want to build a factory like this myself.

The other half of you are thinking about the security implications and going this is crazy. This is reckless. It's unsafe. Pull requests of course are unsanitized uh input, right? Anybody can open a pull request. Anybody can put things in there. Uh you can't really sanitize a pull request because I don't want to remove information from it. There's going to be false positives and everything. You can imagine a pull request that's open to harden for security to defend against prompt injections. It would trigger any kind of um detection. So this goes way beyond lethal trifecta. And um our f our workers, our agents in the factory are taking very sensitive actions.

They're spinning up VMs. They're merging uh pull So, how can we prevent our agents from being prompt injected? You obviously can't do this, right? If you go into a codebase and you see at the top of the cloud. MD uh never run drop database production. So that tells you two things about that agent. It tells you that that agent has deleted a production database before and it tells you that the agent can still do it if they put that instruction there. So it still has that ability. Uh instructions are not for security. They're not for safety. The instructions are for steering your agent towards producing valuable uh high quality output towards the direction you want it to to uh to go for you.

So how do we deal with these kinds of risks with nanoclaw? So we think about our agents as if they're operating in enemy territory behind enemy lines because they're being they're in contact with the enemy, right? The somebody who's potentially a malicious actor who's trying to work against you and get your agent to work against you. So, if you think about a map of a conflict, uh you have the red zone and the blue zone and the blue zone is our side, the red zone is the other side. Agents are operating in the red zone and at any moment they could be turned into a double agent. So, we don't trust our agents and nanoclaw agents are not considered trusted.

Instead, they're isolated. So this is a simplified version of the nano claw architecture you have on the left side slack or whatever messaging app you send the message there it goes to slack servers and then it gets sent to wherever your nano claw is running in this case say a VM there's a slack bridge which connects to the slack server with a socket or web hook every message gets sent to the slack bridge and then from there through a router and pushed to the agent the agent respond responds. It's uh produces some output that's sent back through the router back to Slack bridge, Slack server, and appears in your messaging app as a response from the agent from your, you know, Slack bot or whatnot.

Uh but the agent is potentially compromised. It's operating in the red So, anything that the agent can touch is potentially compromised. If the agent can access the router, if the agent can access the Slack bridge, it can manipulate those and change what messages it has access to, uh, and who it's able to send messages to. So rather than letting the agent access anything in the VM that it's running in or anything in the environment it's running in, we isolate the agent and put it within the VM within another isolation layer. In our case, normally we put it within a container. Now, the container limits the blast radius.

We control what goes in, what goes out, and what happens with the things that are coming out. So, the agent isn't directly connected to a messaging channel. That already does a lot to limit the blast radius, but in order for our agent to access the outside world, uh it needs to have credentials. If it wants to connect to services, whether it's GitHub or um or your calendar, and that could be using CLIs, APIs, MTPs, it doesn't matter. It needs some form of credential. So the second principle, first principle isolation. Second principle is keep credentials outside of the agent environment. The agent's environment is enemy territory.

You don't want to put anything that's highly sensitive in there. Definitely not uh secrets and The only way to ensure the agent will not leak a credential, it can't be done through instructions. It can't really be done through uh DLP or analyzing outputs. The agent can circumvent that as well. The only way to prevent it from leaking a secret is to not give it a secret. So the way that we let it talk to external services that are credentialed without giving it credentials, we add between the agents request a proxy. We give the agent a vault. We partnered with a really great open source project on this uh called one CLI.

Every request leaving the agent sandbox is proxied through the vault and then we check the request and decide if we should add credentials. The request leaves the vault with no credentials with literally authorization bearer placeholder. Literally the word placeholder at the vault. The placeholder is replaced with a real credentials if the agent is supposed to have access to that resource. But isolating the agent and giving it this proxied credentials isn't enough because if someone is talking directly with my agent, even if my agent doesn't hold the key, if it can take the sensitive action and you can manipulate it and prompt inject it, you can get it to take sensitive actions for you.

So maybe you can't get my GitHub access token, but you can potentially get it to add you as a code owner. So we need to have another layer of policies, not just rubber stamping letting every request through, but adding policies of what the agent can and cannot access. The most flexible policy for the most sensitive actions is human in the loop approval. And what that looks like is at the level uh where we're enforcing policies, we can have a policy set. This requires human approval. A request is then sent not from the agent but from the vault or from the uh the router or the delivery part of uh That message gets sent through the router to the Slack bridge and shows up in your messaging app as a permission request from the agent.

Now this is actually an illus illusion. This was in the video before and it looks like the agent is requesting your approval and then you give the agent your approval and then it goes ahead and merges your PR for you. None of that is happening. The agent can't request approval and the agent doesn't actually have credentials to merge. Instead, the agent is trying to make a request using an MCP where it writes out the the command that it wants to run in GH with the GitHub CLI. And then we display that to you as if it's a message, as if it's a request coming from the agent, but that's actually coming from the Nano host process.

Once you approve, the merge is actually done not at the level of the agent, but outside of the agent's environment. And that same uh pattern can be used to do any kind of sensitive action. Initiate a wire transfer for example. The most sensitive actions you need to separate the tool call from the tool execution. The tool call happens within the agent's environment. Within the red zone, it leaves the red zone and outside of the agents environment, you then enforce policies and implement the action if it meets your policies, including human approval. One interesting pattern that's emerged that we found in our agent factory is that we have multiple different people over reviewing and h uh providing oversight over the reviews, the plans, uh the triage.

Whoever presses the button to approve or to send it to testing, it uses their credentials. So you won't see in our GitHub any PRs being merged by a nano claw agent. I'm the one who presses the button. Means I'm proving that this is correct. I'm taking responsibility for it and it's done with my credentials. So this is uh what our factory looks like. Another interesting thing is so you can see here we have the uh slack app uh connects to the slack bridge. We have multiple different bots and then each of those bots get routed to a different nano agent. Each nano agent runs in its own container. So nano claw by default by design is multi- aent and can be multi-user multi-tenant.

Now when the test plan is approved that isn't running automated testing what happens is we have a test uh orchestrator that creates a new VM checks out the branch for that GitHub pull request in the VM. Our test agent then SSHes into the VM, runs the Nano instance and starts poking and proddding the agent, sending them a message in Telegram, getting a response, real life testing, and then also is able to check databases and logs to verify behind the scenes that what you're expecting to happen does happen. Uh, another last interesting pattern is that each of the agents in the Slack thread has a persistent environment and a persistent session.

You can come to them at any time, tag any one of the different agents. We have a testing agent, reviewing agent, uh, and give them direction, ask a follow-up question, uh, change the depth of testing like you see here. We also have this ability to tag a supervisor and give feedback. Uh, you feel a little bit like a Karen, if anyone knows the meme, can I talk to your supervisor? You leave feedback and then um the supervisor can suggest changes to the instructions and to the skills based on that feedback and then once we approve those changes they get implemented. So our factory is improving itself essentially.

So as promised that QR code if you scan it I have my agent in Telegram. It has access to my emails to my calendar uh to my drive. Uh but I feel safe giving you all access because this agent doesn't have any credentials in its environment. It's isolated. I control what goes into its environment and what comes out. And there are human approvals on every action. So that's connected to my calendar. I'll be here all day. I'd love to grab coffee with some people building interesting things in the space. Uh, talk to it. I told it to be a little bit protective over my time. I hope it's not mean. Um, but if you talk to it and tell it what you're building, uh, hopefully it will schedule a coffee chat for you with me.

Thank you. &gt;&gt; All right. Um I'm super excited to introduce uh our next speaker. Uh this is Tibo who is the head of codeex at OpenAI. Now Tibo uh unfortunately could not make it in person today. Uh but he wanted to do the talk because it means a lot for him. So he will explain uh when he when it's uh when he's up on the screen which I think he's there. Uh but the other thing that we'll be doing which is super cool is uh Tibo is excited to speak with some of the students for a Q&amp;A. So, uh, let's give, uh, TBO a warm welcome. &gt;&gt; Hi,

Codex across the software lifecycle, agent reviews, and approval fatigue

Thibault Sottiaux (OpenAI)

Summary

  • Thibault Sottiaux, head of Codex at OpenAI, reframes SDLC as a throughput problem: agentic coding widened the build segment of the pipe, so the binding constraint moved to planning, review, validation, CI, security, deploy, debugging, and learning.
  • Codex is positioned as an everything-agent across that full lifecycle, with a model-plus-harness co-design philosophy.
  • He walks the GPT-5.x line: 5.1-codex-max trained end-to-end with RL for compaction across multiple context windows (delegating to itself), 30% fewer thinking tokens; 5.2 added cyber and vision; 5.3 faster; 5.4 1M context; 5.5 added computer use, set new SOTA on SWE-Bench Pro and Terminal-Bench.
  • Codex revenue post-5 grew 2x faster than any prior release; 4M weekly actives heading to 5M; three-nines availability; ~one model per month cadence.
  • The headline operational primitive is Auto Review: a second agent verifies the main agent's actions against original task intent, blocks suspicious/high-risk steps, redirects the main agent — now default inside OpenAI and reportedly reduced approvals 20x, directly attacking approval fatigue as the agent-scaling bottleneck.
  • Distribution proof-points: Sea Limited ("C Limited") rolled out org-wide with a regional Codex hackathon in Singapore on June 6, and Nvidia rolled out to 45,000 employees in two weeks (Codex deployed itself).
  • The harness is fully open source; remote control via the ChatGPT app lets Codex run on Raspberry Pi / Mac Mini / laptop and be driven from phone, and OpenCode was rewritten on the Codex foundation.
  • Roadmap: Chronicle (screen-following memory in research preview), massive multi-agent systems as a new scaling paradigm, new tool-handling primitives.
  • On the harness-vs-model question (Dehan): they fix in the model if next-train is <2 months out, else patch the harness — explicit tension that avoids ossifying scaffolding into infra.

Why this matters

OpenAI publicly committing to model+harness co-design with Auto Review (an agent verifying another agent's actions against original intent) cutting approvals 20x is the clearest signal yet that approval fatigue, not model capability, is the binding constraint on agent scaling. The operational proof points (Nvidia rolling Codex to 45K employees in two weeks, Sea Limited org-wide, three-nines availability, one model per month) plus the explicit harness-vs-model tradeoff rule (fix in model if next-train is <2 months, else patch the harness) tell builders that durable value lives in the scaffolding agents stand on, not the weights they call.

Slides (21)

Slide notes

  • [t=89:45] Slide: "Codex expands the pipe end-to-end". Diagram compares three rows across SDLC stages (columns: Plan, Build, Review, Deploy). Row 1 "Pre-AI" — all four stage boxes empty (small uniform pixel/dot bands between). Row 2 "Agentic Coding" — only the Build box is filled (cyan), others empty; dot bands appear denser/larger around Build indicating concentrated activity. Row 3 "Codex SDLC" (with Codex logo) — all four stage boxes filled in graduated blue/purple, dot bands dense across the entire pipeline. Visual thesis: pre-AI tools touched no stage agentically; current agentic coding only accelerates Build; Codex SDLC covers Plan/Build/Review/Deploy end-to-end. Footer: OpenAI.
  • [t=99:03] Slide: "Codex for (almost) everything". Arrow-bulleted feature list: In-app browser — Testing and automating; Multimodality; Automations; No-projects chats; Memories; Research; Artifacts; Plugins; Commenting mode. Right side shows a Codex app screenshot titled "Research coffee import costs" with a generated "Paris Coffee Import Brief" document (Coffee Importation & Cost Brief for a Paris Coffee Shop) including Executive Takeaways, Pricing Ranges table (coffee categories Green/Roasted with EUR/kg ranges), Duties and VAT table (HS/TARIC, Duty), French Import VAT note (1 January 2025), Market Snapshot, Compliance Watchlist, and a Limited Cost Model with Layer/Include/Notes columns. A menu mock-up sits alongside.
  • [t=94:24] Slide: "Codex — 99.98% uptime" with green check icon; horizontal bar chart of daily uptime bars (mostly green, a few amber). Tagline: "Building at the scale our customers need". Footer: OpenAI.
  • [t=104:51] Q&A slide (blue background): "QUESTION FOR TIBO — Louis". Question text: "The DevEx on the Codex app is the best I've seen: project organisation, the diff view, one-tap PRs. It has genuinely changed how I build. As agents get more capable and the user base broadens beyond developers, how do you think about the interface layer? Chat feels like a default we inherited from LLMs, but is it actually the right model for how humans should work with agents long-term? What does that evolution look like to you?" Page 2 indicator.

Full transcript (3703 words)

Everyone. Um, glad to be here. I would have loved to be in person. It's just really incredibly exciting to see the room so packed. Uh, Singapore has such a unique energy and I'm excited to chat to you all from San Francisco. Feel really proud to say that San Franc Singapore is actually the top in the top five countries globally for codex adoption and engagement. Uh it rose there fast. Uh it feels like Singapore is just adopting new technologies and an unprecedented rate. Uh our mission overall is to deliver the benefits of AGI to all of humanity. And I believe that in the coming months we'll make such incredible progress towards making AI deeply valuable to each and everyone in the world.

We started with Chat GPT and with Codex we have focused on builders and developers. You might know Codex as this little app, but for us it is our frontier agent. And I'm going to talk a little bit about what agents have done to software development and the whole life cycle. I don't have to tell this room, but software development obviously is unrecognizable compared to two years ago, even six months ago. New models are capable of full agentic delegation or examples like we saw with nanoclaw where you have a full autonomous system just doing stuff for you uh going far beyond programming. You just give it a job.

It works on the task the codebase perhaps for hours independently sometimes a full day until the job is done. From the beginning, that's been our goal to build an AI teammate you can delegate A useful way to think about the SDLC and building things is to think about it as a throughput problem. For decades, the software development life cycle was designed around one core Code is hard to write. That assumption shaped really everything around it. We planned heavily because engineering time was scarce. We reviewed every line carefully because code was expensive to get wrong. We built delivery systems around the idea that the build step was the narrowest part of the pipe.

Aentic coding has really changed this It has dramatically widened the belt section of the pipe. But if the rest stays narrow, the total throughput does not actually increase. The constraint moves into the systems around the build step planning reviewing validation CI security release operations debugging and even learning and understanding of what's actually which is a big part of the new bottleneck. That shift is something that everyone here needs to understand. The opportunity was not just to generate more code faster, but it is to redesign the entirety of how we do engineering and how we can increase the overall throughput of what we deliver together.

The first wave of AI coding just really expanded this build phase. We were all very excited to be able to write a lot of code faster. That matters. It means engineers can generate, modify, and test at a speed never seen before. But as we said earlier, just widening the build section does not increase total throughput. The next step was really to look at expanding capacity across the full software delivery life cycle. This is how we think about codeex, our agent. It's not just a coding assistant, but an agent that can work across the full layer of building software. In the build step, critics can help engineers delegate implementation work.

In review, critics can help inspect changes, surface issues, support human review. In deploy and operations, cloud agents and automations can help teams respond to triggers, investigate issues, and keep work moving through the system at unprecedented pace. The goal is not to remove humans from the process. The goal is really to make every stage more scalable. So higher code output can actually become more shipped value. This is a key distinction. Agentic coding increases code velocity but agents like codeex help organizations expand the system around that velocity. So there are like these different steps here and we can see that you can use an agent to increase the velocity of planning, velocity of building, velocity of reviewing and even the velocity of deploying.

If you think about it, planning, building and reviewing is a little bit easier because you don't really have a side effect upon the world. And deploy is when you know security starts to really matter as you're having an actual impact and the code gets actually deployed out there and meets your users where they are. We have automations for this. We allow to build around the agency as well. And then we have a version of our cloud agents um which has secure can have secure access through our plug-in system and allow you to deploy and verify that the deploys are correct through human This is a journey we started a long time ago.

The Codex team is special in a sense that we designed both the agent and the models to power those agents and we work deeply within the research in order to advance the state-of-the-art for our models. This started with a model GPT51codex max which is now famous for its name which we released at the end of 2025. It was trained on end to end RL for compaction for longunning tasks. This means that in its environment during RL, we were exercising tasks that would challenge the model to work well beyond its context window. And at the end of its context window, it would need to delegate to itself in order to achieve a task across many context windows of inference.

We also shipped a high reasoning effort. We trained it to operate natively on Windows and we showed that we could achieve better performance with 30% fewer thinking tokens and achieving a new state-of-the-art token efficiency. This is a theme that will continue and that we've seen across every other model ship. The token efficiency just gets better and better and better, which both makes it faster and cheaper to run agents over time. With 52, we increased cyber security capabilities, which is really a precursor to what we've seen now with models with unprecedented unprecedented capabilities around cyber.

We've improved performance across large code changes, but also we added vision capabilities. We're not just building a texttoext model. We're building an everything agent. With 53, we made it faster. With 54, we added 1 million context window. And 55 has been our biggest step change so far. Even though it seems at face value, it's only a little incremental 0.1 uh improvement from 54 to 55. It was actually a much bigger change. We added computer use and we made it even more token efficient. It is really the smartest and fastest model available out there today. What makes it work though? What makes it work though is not just the model.

It is a combination of the model and its hardness. This is why Codex is special. We're capable of like co-designing these things and making the harness really optimized for the model and the model optimized for the harness. It allows us to deliver a new class of intelligence very broadly and very efficiently. Five was released only a few weeks ago and we saw revenue grow more than two times faster than any prior release. People really love it. We've seen adoption really go wild. You can see here that it set a new industry high on SweetBench Pro. We also achieved a new soda on terminal bench. It seems like we're just pushing the frontier model after model, model after model, and we're now shipping at a cadence of roughly one model a month.

We did all this while also delivering unprecedented And this was no no short feat. Really the level of engineering and infrastructure improvements that we needed to deliver here were started a year ago which allowed us to scale with unprecedented demand. Usage exploded. We're serving 55 at a level of traffic that um makes me fail at times. Like we have like an amazing team of engineers and own callers and it's also one thing that is rarely talked about is how efficient our models are and this allows us to offer like just really generous limits across the We've achieved like nine nine three nines of availability which I'm very proud of.

Um all while scaling and being used across hundreds of companies. We've now over four million and approaching soon 5 million weekly active users. And it's never been a better time to get started. A lot of engineers write more code. We've already talked about that. But what we haven't talked about is that inside of OpenAI just really everyone, everyone I see, everyone I talk to uses Codeex for literally everything, not just for engineering. We're seeing the marketing department use it. We're seeing finance raising uh rounds of incredible fundraisers using Codex to coordinate it all. It has become this everything agent.

And because we were building Codex using Codex, we've never built faster. We've released an extraordinary amount of features this year. Team configuration, new models, Codex for Windows. The Codex app itself is only 3 months old, which still shocks me when I think about it. We've released fast mode. We released auto review as well, which is one of my favorite features. When you think about agents and security and safety, one thing that is often overlooked is that approvals and human approvals are something that leads to fatigue and mistakes over time. If you have to go and like verify everything that your agent is doing and like thinking hard about whether you want to approve it or not, then you're bound to make a mistake at some point and give it too much access or allow it to do something or merge a PR or worse send some information off somewhere that you shouldn't have done.

This is going to be true as we continue to scale and you have many more agents working for you. Auto review is a new system which introduces a second agent which verifies the actions of the first agent and it verifies them against the original intent of your task. So if you say hey go and check my important emails for example and pull the last three that you know are specific to the goals that I have set today. Then auto review will understand that this is your intent and verify each action from the main agent against that intent. Anything that is suspicious or high risk and other line with that intent will get blocked and the main agent will get redirected to try and do something else.

This is very important because it allows you to preserve the human attention and not fatigue you with unnecessary approvals. This is the default now within OpenAI and it has reduced approvals by a factor of 20. We're seeing gains across the company across much more than coding. There's a bunch of pillars that we're investing automating more deeper enterprise controls, leading on models and the overall developer experience. I'm really proud of how polished the application has been that we have shipped and how delightful the experience is. I invite you to all try it. It's really a different way to interact with agents and over time we're going to evolve it into the cockpit for every agent that you manage.

CODC unlocks so much for builders but also for almost everything. We're seeing really incredible use cases for even nontechnical people. This is Rowan's mom just experiencing the magic of image genen 2 for the first time in and she's longtime recruiter. She really needed to do a bunch of things across uh managing her resumes and she wanted to go back into recruiting. We showed her codeex and she just immediately got it. There's new ways of interacting with agents that is really going to come to everyone. We don't think agents are just for technical people. There are different challenges when you think about bringing agents to the world where you just really need to preserve the magic while also making it safe and secure.

But we think this is going to come to the world very soon and it's not going to be just enabling engineers and technical people to become more We're linking our agents to the entire world. We have plugins for almost everything. We also are working on memory systems. We're working on new models. You can set up automation so they can run on specific on a specific schedule, maybe every few hours to give you a report. And really what we're starting to see is that the models are so reliable at doing complex tasks that it's just really a question of what is the context and what is the access that you give those models.

And this is really what is capping the potential right now. It's like how much access to the world these models have. We're seeing great success um in different areas. Let me make this concrete with one example of one of the world's most advanced engineering C Limited, one of APAC's latest, largest digital platforms and a major open customer. C has gone allin with Codex. It's rolled it out across its entire developer organization and its chief product officer shared with us that Codex just goes beyond coding and feels pretty magical. We're excited to have the first regional Codex hackathons here at C starting on the 6th of June.

Right here in Singapore. I'd love for you all to join and do check it out online. We rolled it out to also 45,000 employees across Nvidia. Um, we did it in only two weeks. Codex helped itself with the deployment within Nvidia and this is a trend that we're seeing. We're just using agents to accelerate everything including the deployment and development of Codex What's special about Codex is that it's it's entirely open source. You can read the code of the harness uh just on GitHub. It's under the the Codex repo. Uh you can also bring it anywhere. We now just released remote control through the chat GBT app.

So you can have it run on a Raspberry Pi, you can have it run on a Mac Mini, you can have it run on your laptop and then fully control it over a secure connection uh straight from your app. You can also, the thing that's pretty magical that I love doing is using the plugins for browser use or computer use and allow it to just use and navigate across your computer, but like use this little command, this little remote control that you just have on your phone. And I think this is something that we will soon realize is that agents will have a certain permanence to it and we will just really start to consider them to be like these little entities in the cloud that we can reach from all sorts of different clients.

Be it on the web, be it through a desktop app, be it through a client. Eventually, you'll just pick up your phone and talk to your agent and it will still be able to do things for you and have access to everything in your life. We also ship fast and we we fix fast. Uh we we don't we're not shy of sometimes making mistakes and um resetting some uh rate limits when we get things wrong. One thing that's cool as well is Peter is working with me. He's the original creator of OpenClaw. We also support this as an open source project. We recently worked on rewriting the core of open claw to be based on the same foundation as codeex.

So it actually runs the codex agent under the hood. You can read about it uh on on the open source repo. Again like all of this code is open source and we really want to contribute to like this new generation of inventions by just showing how you can do these things in a simple way. Um we're taking safety first. We're also thinking a lot about security. We innovated on Windows sandboxing. We publish a lot about this on our blog posts. You can read all about the Windows sandboxing there. And we're trying to solve the hard problems at the product layer as well. And in the future, we hope to bring agents to the scale of all of Chadbt, which almost now has a billion users.

There's a lot of things that I'm excited about, but here are some that we're really working on hard. We're working on new memory systems. We shipped Chronicle which is an experimental in research preview which allows your agent to just follow everything that you've done on your screen and form memories from it so that you know it knows what you did last week. It knows what you did it did you did during the day and it gets a lot more contextual. We're working on massive multi- aent systems with the next generation of models. We think this will be quite groundbreaking and a new way in a new scaling paradigm.

Um, and then we're working on new ways of handling tools which I'm excited to share on more in the future. I heard that several of the billers in this room wanted um to ask a few questions and unfortunately I'm not able to hear the questions live but we've compiled a few of the questions and I would love to go through two of them. Here's a question from Louis. The DevX on the Codex app is the best I've seen. Project organization, the div one tap PRs. It has changed how I build. As agents get more capable and the user base broadens beyond developers, how do you think about the interface layer? Chat feels like a default we inherited from LLMs.

It is actually the right model for how humans should work with agents long term. What does that evolution looks like to you? I think this is very interesting and initially we just really inherited this thing where we were powering LLMs through LLMs who were powering chat conversational interfaces and chat GBT started that revolution and what we're seeing is now that LLMs can do things on your behalf and get access to everything. We have to evolve how we think about these things. And it's just really going to profoundly change, I think, the way that we interact with computers, with technology. And I hope it sort of frees us from some of the limitations that we have, I think, collectively found as well where we're always glued on our phone, you know, kind of tucked over um you know, maybe we're like typing furiously on our laptop and sort of like we're not connecting enough to others.

I think the future is going to be a future where people are much more connected and everything is much more ambient and seamless around you and you can interact with technology through natural language through natural voice in a very multimodal way and it sort of fluidly adapts to what you want to do in that moment. And this is hard to imagine now, but I think, you know, within a year or so, we're going to start seeing like the signs of that where agents get embodied and things get a lot more natural. Just you continue to leverage all of this through natural voice. Um, we're going to break apart the boundaries of the applications that exist today on your Dehan asked, "You've said some scaffolding should disappear as the model gets better, but skill seems like a kind of user-owned scaffolding that should maybe stay.

When somebody something fails, how do you decide whether to fix it in the model to harness a skill or somewhere else without accidentally turning today's model limitations into tomorrow's infra?" This is something that we think a lot about and it's something that is unique to our setup where we have control over the model. We have control over the harness and the product and the agent primitives. Oftentimes we actually ask ourselves hey what if we don't fix this in the harness today how quickly is it going to be possible to improve the models this is something that you know for example for end to end compaction and end toend RL and compaction for very longunning tasks before that people were trying to sort of like fix this with like manual compaction and very complex systems to keep state around we were thinking maybe we can fix this by working very hard on the next model train and just being able to keep that coherence around very long horizons of tasks.

Um, and so we fix it at the model. Sometimes we estimate that it takes more than a couple of months to fix it in the next generation of models and then we decide to take a bit of a shortcut uh and fix it in the harness instead. And so there's always this healthy tension, but we're able to co-design things and just really approach things from first principles uh which always makes me very excited to think about these problems. There's a few more questions, but uh I think I'm running a little bit short of time. And I just wanted to thank you all for being here. Um and I invite you to just all think with this technology and like you know think about you know what the future is going to look like and you know to invite it into your lives.

It's here to stay. It's going to continue to evolve. It's a beautiful time to just explore all of these things and I hope you have a wonderful time building. I would like to invite to the stage Dr. Fun Yang, head of AI practice at Govek. Um, good morning everyone.

AI Native Government: Singapore's Sovereign Agentic Harness

Dr Feng Yuzhang, Head of AI, GovTech Singapore

Summary

  • Dr Feng Yuzhang (head of AI, GovTech SG) frames Singapore's shift from 'AI-enabled' (bolted onto legacy, incremental, scales but doesn't compound) to 'AI-native' (AI as foundation, reimagined from ground up, continuously compounding).
  • Deployed cases: pre-dementia detection (Nature Communications, rolling out at community sites this year); AI-assisted teacher marking saves 3-4 hours per class; MyCareersFuture recommendation engine; citizen call-centre LLM stack cut after-call work 72%.
  • AI-native strategy = four persona pillars + one horizontal: (1) augment all 150,000 public officers — personally training permanent secretaries on building agents in two weeks; (2) citizen developers (non-technical policy/engagement officers) vibe-coding prototypes; (3) engineers across full SDLC using Claude Code and Codex; (4) AI-driven domain transformation in education, transport, healthcare, plus HR/finance.
  • Horizontal Government AI Stack: foundation models plus vision, speech, doc analysis, evals, safety — all government-context tuned.
  • Centrepiece: a sovereign agentic harness — MCP gateway (front door), agentic runtime sandbox, agent identity, agent memory (short and long-term), observability, versioned/governed Skills platform. Every coding agent, co-work session, or workflow agent in government is a client of this one stack ('one door, everything visible').
  • Cites IDC 1.3B agents by 2028 forecast; calls it conservative.
  • Closes with car analogy — models are engines but you need bodies, roads, traffic rules; the harness is the moat.

Why this matters

A sovereign government publicly architecting an agentic harness MCP gateway, runtime sandbox, agent identity, short/long-term memory, observability, and a versioned Skills platform as the substrate for 150,000 public officers is a real-world stress test of the 'harness as moat' thesis at national scale. GovTech's car analogy (models are engines, you need bodies, roads, and traffic rules) plus the IDC '1.3B agents by 2028' anchor reframes the sovereign-AI conversation away from training frontier models and toward governed deployment infrastructure; Skills being named a first-class primitive is a strong tell about where Anthropic-aligned tooling is headed.

Slides (17)

Slide notes

  • [t=116:45] Slide 7 'What does AI-native government mean?' — 4 Pillars: (01) Every officer augmented — 150,000+ officers need AI embedded in daily tools/workflows; (02) Citizen developers building — non-technical officers closest to problem prototype and deploy; (03) Engineering with AI leverage — SWEs build with greater speed/quality, compressing dev cycles; (04) Transformation and modernisation — completely transforming domain/function, redesigning processes for better outcome.
  • [t=116:45] PlatformAI underpins all initiatives: 40+ Models (same-day launch, managed throughput & costs, proprietary & open weight) + 30+ custom-built AI Capabilities (Vision, Speech, Documents, Text, Evals, Safety) + Agent Harness (orchestration loops, memory, tool connections, observability, safety and security).
  • [t=121:21] Slide 10 Core metaphor: 'Super engines alone are not enough. You need robust cars, established roads and clear traffic rules for safe, efficient commutes.' Blueprint diagram labels: HARNESS (add brakes here, later adaptation, custom build — do not grind), MODEL (test launch & decel), CONTEXT (fuel lines, high octane only, refill port), GOVERNED LANES, PLATFORM (load-bearing substrate, maintained centrally), substrate components listed: observability, memory, identity, MCP gateway — all maintained centrally.
  • [t=121:21] Architecture mapping: Model = engine (swappable); Harness = car chassis (agent scaffolding, brakes/safety, custom build per use case); Platform = road (centrally maintained load-bearing substrate of observability + memory + identity + MCP gateway); Context = fuel (governed quality input); Governed Lanes = traffic rules / policy guardrails.
  • [t=112:10] Slide 5 'Some examples of AI use cases in govt' — 5 apps: Pensieve (Healthcare, AI tablet screens seniors for cognitive impairment via drawing analysis, replaces 30-min pen-and-paper tests, rolling out community sites 2026); AI Guardian (AI Safety — Litmus = Testing-as-a-Service red-teams AI apps before launch; Sentinel = Guardrails-as-a-Service detecting unsafe/irrelevant content in production LLM apps); Mark.ly (Education, AI grading assistant, 82% accuracy, cuts 3-4 hrs marking per class, surfaces data-driven student insights); JumpStart (Recommendation, reusable AI recommendation platform — personalised job/course/pathway recs at scale with shared MLOps); Transcribe (Service Delivery, auto-transcribes contact centre calls, summaries + CRM analytics, supports ServiceSG hotlines so operators focus on callers vs note-taking).

Full transcript (1912 words)

Uh, my name is Yang. I lead the AI team at GFE Singapore. Very glad to be here today um at the AI engineer Singapore event to share with you uh how we are driving the adoption of AI in Singapore government. Wrong clicker. Yes. So a very quick introduction of GFT in case you are not familiar. Uh Gtech is the lead agency driving Singapore's uh smart nation initiative and public sector digital um transformation. We harness the power of technology to deliver digital government services. I'm sure some of you actually many of you have actually used some gap products such as Syncpus live SG go business etc.

Our mission is really to engineer a digital government and make life better. Um in fact GVtech was formed uh in in the year 2016 and this year we are celebrating 10 years of tech for public Coming back to AI, it is clear that the government must adopt AI. Um the first and most immediate reason is obviously for effectiveness and efficiency. Our government is responsible for delivering services that millions of people depend on every day. AI give us the opportunity to actually do this much faster, more accurately and at greater scale. So that is an opportunity we cannot afford to miss. But beyond the operational gains, there is the there's the question of expectation from the citizens and businesses.

As new technology reshape how citizens live and how businesses operate, people increasingly expect the government to keep pace with the This will increase the trust and from the people the government serves. There's also a deeper reason to govern well in the digital in the digital world. We need to understand the technology shaping it. Hands-on experience with AI builds the intuition necessary to craft the policy that are thoughtful, grounded, and fit for purpose. Pro protecting our citizens while enabling innovation. And finally, if we want our entire nation to embrace AI like how our prime minister has say so the government must not must must not sit on the sideline.

We must lead by example. When citizens see their government using AI responsibly and effectively, it builds the confidence and sets the tone for the whole of society. In fact, we are not starting from scratch. For years, our government has already been using AI in many areas to inform policy and improve operations and service delivery both internally within the agencies as well as externally serving citizens and businesses. Just to share a few examples uh among the tons of AI use cases that we have implemented in government in healthcare AI has been developed to detect early signs of pre-dementia.

This technology achieve a very high level of accuracy and the results were published in the scientific journal nature communications. We are actually rolling out this technology at community sites this year. In education, AI has been deployed to assist teachers mark assignments faster with higher accuracy, cutting down three to four hours of marking per class and allowing teachers more time to engage the students. For jobs and skill. We uh our recommendation engine has been powering my career's future in delivering personalized job and course for Singaporeans and residents to find more suitable jobs faster and also learn new skills more effectively.

For citizen services, we have developed and and deployed the latest AI model to our citizen call centers. The transcription, summarization, and analytics capabilities allow us to serve our citizens better, reducing the afterpaw works by 72% and improving customer satisfaction to At the same time, we also make sure we apply AI responsibly by developing safety testing tools and guardrails to ensure our AI solutions are safe, secure, and behaving in the intended While we have made significant progress in bring AI into the government uh over the last few years, we aspire to actually go even further from being AI enabled to becoming an AI native government.

So what is the difference you may ask? An AI enabled government uses AI as a tool, a helpful a helpful addition to the existing processes. It is usually built on top of legacy systems and is there is increment incremental improvement. The system scales but not compounds. On the contrary, an AI native government is something far more ambitious. It means AI is the foundation and the core of everything. We reimagine how government works from the ground up with AI embedded in the way we think. Design and deliver there is always continuous innovation. So what does AI native government means to us exactly and how we are working towards that?

We think about it in four pillars differentiated by user personas and one horizontal. Let me let me just quickly walk you Firstly, we want every single public officer to be augmented by AI. All 150,000s of them from ground staff all the way to the prime minister. No I think just now Minister Vivian talked about how he uses and builds AI. Uh in two in two weeks time I'm going to conduct a technical hands-on training to a room full of permanent secretaries on building agents. We really want to put the AI productivity tools into the hands of every single public officer to help them with their daily tasks and workflows such as drafting, summarizing, transcribing, analyzing and etc.

Second, we want citizen developers to be building with AI. These are basically the non-technical officers who are closest to the problem statements that we are interested in. They can be policy officers, you know, they can be uh citizen engagement officers, they can be product managers or designers. We want to provide the tools to them for them to be able to vibe code, create prototypes and deploy them. Personally, I feel this is a gamecher because it will change the entire innovation model within the and now without relying on the engineers, people can actually bring their ideas to life early in the stage.

Thirdly, for software engineers, AI allows them to build production grade application with greater speed and compressing the entire software development life cycle. We have already roll out many um various AI coding assistants like clock code, codeex to our developers. This is not just to help them with the activity of coding but also the entire SDLC such as code review, testing and Last pillar is about AI for domain and uh domain transformation and We want to focus on a few key domains such as education, transportation and healthcare and the cross cutting functions such as HR and finance and completely redesign the business processes for better outcome.

You will see underpinning all these AI initiatives is our government AI stack which really provides the latest foundation models and those customuilt AI capabilities in vision, speech, document analysis, evals and safety all with government context and This will ensure our AI solutions are supported by performant models, have shorter time to market and also are safe and secure by design. As part of the platform, we are also building capabilities in agent harness. Let me spend a few minutes to explain what it is and why we are doing it. Looking ahead, we understand from the industry that you know AI agents is going to proliferate very soon.

This will mean AI becoming more capable having access to data, having access to tools, being able to perform actions in autonomous fashion. According to a IDC study, there will be more than 1.3 billion AI agents by the year 2028. Is a very big and scary number but personally I find this could be actually very conservative by the rate of development that we we can we can already see people starting to develop agents either for their personal use for the team's collaboration or even consumption at the at the enterprise there's a whole range of use cases for in the government for agent AI in citizen services policy research etc With the proliferation of AI agents in the government, we must we must think about a way to effectively enable them, optimize them and manage them so that we can maximize the value and manage the any associated risks that comes with we are building a sovereign agentic harness which include a few components.

MCP gateway which serves as a front door, agentic runtime which provides a sandbox environment and also the resources for the agents to perform their actions. A agent identity uh management which ensure each agent has a verified identity knows what is allowed to do and cannot overstep its An agent memory which provides personalized experience to users with short-term memory within a single conversation and the long-term memory across multiple sessions. Observability is important. It provides the oversight to the entire agentic ecosystem monitoring what agents have been doing catching problem early and understanding what has gone wrong.

A skills platform which contain a rich library of readymade capabilities like searching the web, reading documents, sending emails all versioned, evaluated, sharable and governed so that agents can draw upon to complete their tasks. The idea is that every single assistant or agents in the government whether it's a coding agents or it's a co-work session or it's a workflow agent is a client of this stack. One door everything is visible. You might think that it's relatively simple to think about it, you know, in an individual local setup, but it is a whole different ball game at the enterprise level, especially if you think about it, you know, in an ecosystem across multiple organizations within the government.

As an analogy, I always like to think about it, you know, in the car example, a super powerful car engine by themselves are not good enough to actually bring people from one point to another. You need robust car car bodies. You literally need the the roads. You also need the clear traffic rules, you know, for safe and efficient commute. Similarly, AI models are like car engines. They are not good enough to be effective agents. They need a harness to be truly useful and trustworthy. Hence, a key strategy for us to work towards the agent AI is actually to invest heavily in building these capabilities in the agent harness.

That leads to the end of my sharing. Thank you very much for your attention. It is really exciting time. It is really exciting time ahead of us. Uh do collaborate with us. Um and also you know if you interested joining us in this meaningful journey AI for public good if you are interested please visit our booth you know we have teams showcasing some of the work the initiative the projects that we are working on they will be more than happy to share more details with you uh I'll also be very happy to connect with you on LinkedIn and also share more about collaboration opportunities. Thank you very much.

All right, I would like to invite to the stage our first speaker in our design track, Phil, CEO and co-founder of Air Also, quick PSA. Um, it is past 10:00 a.m. So, our expose are actually all all open as well in Pullman as well as uh Capitol Kinsky. If you need to refer to any maps, we have some tools for that. Thank you. To just straight in. &gt;&gt; Testing. Testing. &gt;&gt; awesome. Good morning everybody.

Design intent, human taste, and keeping AI products from flattening into sameness

Phil Hedayatnia, co-founder, Airfoil

Summary

  • Phil Hedayatnia (Airfoil, SF/SG design firm) frames the existential question for design firms: do designers have a job in two years as code-gen models commoditize UI output?
  • Airfoil Labs' answer: build harness tooling around the model, not compete with it. They shipped three internal tools: Check (image-model diff between Figma and live staging that auto-feeds defects back into a code model for self-improving sites), Scoop (turns 2-3 page client briefs into 50+ page context packs), and now Melt (the headline product, undemoed before).
  • Thesis: training on design outcomes yields slop-mitigation (impeccable.style, contrast/typography rules) but not novelty.
  • Real design is applied psychology and Eagleman/Brandt-style 'bending, breaking, blending' — process artifacts that vanish when you only ingest the final asset.
  • Melt captures the upstream: designers save inspiration (Twitter, in-the-wild photos, restaurant menus in Vietnam) to a 'backpack' with auto-extracted metadata (typography, palette, geo, company background) plus human annotations.
  • Multiplayer: teammates' comments become retrieval signal for natural-language mood board queries. Blend (next layer, not yet shipped) remixes saved references plus comments to render full page mockups — claimed to beat raw Claude/GPT.
  • Endgame: legible-to-LLM taste corpus enables generative UI that follows human intent rather than averaging the training set.

Why this matters

Airfoil's answer to the existential question facing design firms build the harness around the model, not compete with it is the same orchestration-beats-model pattern showing up in a non-code vertical. Melt's design is structurally a domain-specific Skill/MCP for taste: ingest unstructured human reasoning (inspiration, annotations, multiplayer comments), make it legible to an LLM, capture defensible workflow data. The 'process artifacts as moat' insight that the value isn't the output, it's the captured reasoning ladder behind it generalizes to any domain where training on outcomes alone yields slop.

Slides (46)

Slide notes

  • [t=126:27] Impeccable product mock ([impeccable.style](http://impeccable.style)): "Design fluency for AI harnesses." Tagline — "Great design prompts require design vocabulary. Most people don't have it. Impeccable teaches your AI deep design knowledge and gives you 23 commands to steer the result." Nav: Home / Designing / Docs / Slop / Live, 27k GitHub stars. What's included: Impeccable agent skill with 23 design commands; optional CLI + Chrome extension. GET STARTED CTA. "Works with" row of logos (Anthropic, Claude, Google, OpenAI, GitHub, Cursor, plus others). v3.1.0 note: Codex asset producer agent, critique persistence, palette-first image flow. Right-side Before/After demo card: generic AI output vs. "with design skills" — annotated with PURPLE GRADIENT and CARDS ON CARDS callouts on a draggable slider.
  • [t=129:50] Transition slide on magenta background: "How do we solve this problem?" — Airfoil wordmark bottom-right. Sets up the product reveal that follows.
  • [t=133:13] Airfoil canvas UI — "Perplexity Computer Landing Page — Lineage" view, Top-Down layout. Parent node "Perplexity ..." tagged capture-page (screenshot of Perplexity Computer Automates landing). Three child nodes connected by "remix" edges: remix-3.png, remix-2.png, remix-1.png — each a stylistic variant of the captured page ("Computer Automates" / "Compute Automates" / "The Computer Creates"). Demonstrates capture-then-remix lineage / version-graph workflow as Airfoil's design-fluency tool in action.

Full transcript (2210 words)

It's great to great to see all of you here and honestly so surreal um to think that this entire conference is happening that so many of you have traveled from around the world and coming from Singapore to be here. Um I'm my name is Phil Hedatnea. I'm the co-founder of a company called Airfoil. Um we're basically a combination of a product design, brand design, and design research firm that works with companies across the tech sector. Um, but we've actually been dual based in San Francisco and in Singapore for the last 5 years. So, it's awesome to see you all here. Um, whether or not you know who we are, uh, you may have interacted with some products you've worked on in the past.

So, for example, if you're doing document processing with agents, you might be using Reduct. If you're embedding voice AIs into your application, maybe Vappy. If you're doing a Gentic Search, maybe Exa. Is there Oh, someone in the back is okay. Uh, or if you're here from crypto, maybe Salana. Um, but I wanted to basically about a year ago we built a team at Airflow called Airflow Labs because there was a question on all of our minds and the question was very very simple. Will we have a job in two years? Because as a design firm, right, especially if you've been on Twitter and seen the talk about the design tax and the way in which increasingly improving models will enable us to just build things without the need for a designer.

We were honestly a little scared. Um we wanted to know where our place really was in the design process. So we started building. We made things internally like check which is our own uh engine to effectively QA the implementation of our design. So we can take a Figma file on one side, we can take a live staging site on the other, use image models to then compare the two together and make sure that we've implemented them properly. Um eventually this turned into something kind of cool which was self-improving websites. So, because we're able to stack rank and uh prioritize based on severity, we're able to then feed that directly back into a code model and then constantly make sites better even after we've released the first dev version.

We built something called scoop, which is um effectively it just takes all the information that a client gives us and turns it into a really comprehensive brief. Takes two or three pages of context we get and turns that into 50 or more. But it importantly gives designers more context on the industries they're designing for, the customers and users they're designing for so that they can do better work. But after all of that, we started to put our heads together around what's effectively the holy grail, the thing that everyone's trying to solve. How can we create design agents that actually have taste that are able to produce things that don't just look like slop?

And so today, I want to present a little bit of what we've learned. Here it is. Okay. Not that. Not that at all. Um, that's actually a screenshot from impeccable.style. Um, which is something that you can download. We didn't make it, but it helps uh your agents have better design fluency. The way that it works is it basically tells agents a bunch of things not to do, right? Make sure that your color contrast is appropriate or use better typography. And that does make a meaningful difference. You can see without impeccable.style and with impeccable.style style that site looks a lot better, but it still kind of looks like slop.

It looks like something that you were able to just generate directly. So why is that? Why does that continue to happen? Well, our view is that training AI on what we consider good design doesn't teach AI how we got there. And it misses a very important point. Design isn't about taking product specs to Figamox. Design is about applied psychology. It's about understanding how the user thinks, how the user acts, and crafting the flows, visuals, and narratives that will resonate with them. I like to say that designers are investigators of human psychology. This is a mood board that my co-founder put together for a merch project that we're working on.

And it actually seems kind of random at first. If you look up in the top left corner, you'll see a photo of California Street in San Francisco. And it's unclear what that has to do with merch. But what it really means is it's a way for us to categorize the things that we derive meaning from. These images may seem random at first, but they express meaning to someone. And when designers put together these mood boards, that they're trying to understand. They're trying to investigate why people resonate with certain things, come up with rules for how to do that, and then apply that to their own work.

And there's another way to look at this. It's just human creativity. Uh there's a book called The Runaway Species by Anthony Bran and David Eagleman. And Tony Bran actually uh is a professor at Rice University where I went. I studied under him. He was one of the biggest inspirations for me and one of the reasons I got into design. And what the runaway species articulates is a definition of human creativity which is the bending, breaking, and blending of existing concepts to create things novel relative to the culture in which they're introduced. Put simply, it's not that people are born with creativity, that they're they have an inate characteristic to be creative.

We all are creative every day. It's a simple part of of how our brains work. But it's not just a neuroscientific definition. It's a sociological definition. We see this in things like biomimicry. The reason that the shinkenzen doesn't cause a sonic boom as it exits the exits a tunnel and through a mountain is because they modeled the shinken not just uh off of other trains but off of the bill of the kingfisher. That was the insight they derived from nature and applied to a totally different context and even in something like a website like say the reducto site that we worked on um we wanted to make it feel friendlier and more accessible to people.

So we introduced page elements that brought back dot mat the the elements of dot matrix printers. You can see an actual example over here. It's little decisions like that that make all the difference to making interfaces and brands look great and makes the difference between stuff that looks like slop and stuff that looks innate and and really creative. But my key point is that none of that can be extracted from outcomes. You can train on outcomes and you'll eventually get overall better visuals that don't make clear mistakes, but you won't get visuals that are novel, interesting, and new. When we train models on ideal design outcomes without the context and thinking behind them, that's when we get underwhelming results.

So, we decided to take a stab at how to solve this problem. And I'm going to show this to you for the first time. We've not demoed this before. Um, this is something that's currently internal to us, but we hope to bring it to the public soon. Um, I want to give you a first look today at something we built called Melt. So, Melt starts where a lot of our designers do, which is design Twitter. Um, but this is the case for a lot of designers, right? They're always going out into the world. They're finding inspiration. They're looking at an interesting brand direction, and now they can just save it to melt.

They can click the save to Meld button, and then we save it away to what we call their backpack. Or let's say they were on a trip to Vietnam, and they went to a restaurant called Pizza Four Pas, and they're like, "This is a pizza restaurant, but it's really beautiful brand direction." And it is great. Uh illustration like their menus are even gorgeous. They can just take that thing that they saw, they can save it directly to Mel and then we start to extract key metadata from that things like typography, color use, but also background information on the company itself and where you were when you took it.

Once we have all that metadata together, we're able and that's what the desktop one looks like. We're able to put that in your backpack and you can access all of that information later. So, why would you want to have all of that information? Well, the first reason as a designer is you want to use it the way you would use a notebook. By annotating on the samples that you save, you're able to record your thinking at the time, which means that in six months time, if you need to come back to any of that, you can do so immediately. But it also means we can start to make connections between the metadata we're collecting and how it impacts how people think about it, how it impacts their perception.

So this means that I can ask more complex queries. For example, I can say find visuals from Vietnam 2026 with serif typography and blue, yellow, or black and white color palettes. The second half of that you might be able to just do directly with an image model, but the combined query gets easier to do once we have all of that in there. And so you can see it says find three saves in Vietnam 2026. And it's able to understand in a bit deeper of a way what it's actually referencing. And of course, you can oneclick export to Figma. That's the thing every designer wants. Where it gets more interesting is that once Mel understands your reasoning for why you've saved things, it can surface that to other people and it can give you the ability to share in a more multiplayer way.

So on Mel, you can make general queries that then are able to use the comments and annotations other people have left to better understand the content you want to find. So it's a much more effective content finding engine. So if I say something like assemble a mood board with light and airy UI, it's able to find references not just that I saved but that my teammates also saved with context about why they saved it which is really really important. Coming back to that definition of creativity again, our view is that by enhancing recall and putting things in front of people faster, but also also making sure to save that creative process, to save the feedback, the back and forth, the comments that uh make uh work resonate with people, understanding it in a deeper way.

That's what enables us to then take the next step, which is actually try to infuse quote unquote taste or rather the intention of the human designer into the work that these models actually produce. So this isn't part of the product yet. It's something we built called Blend, but it's able to use both the visual references in Mel that you've saved as well as the metadata and comments to remix different things together. We're now building tools that enable us to render entire mockups of pages just using the influences in Meld coupled with custom prompts and commands. And though it's still a bit rough and we've got some more work to do on it, it is yielding way better results for us than just using claude or GPT directly.

Der Rams once said that you cannot understand good design if you don't understand people because design is made for people. I think the problem with design agents today is that we're spending a lot of time looking at what people make and not looking at why they make it. But by bringing all of that onto a platform and making it legible to LLMs, I believe that will unlock the next generation of design agents that are able to act more intelligently to follow our intent and even to make decisions on their own. We want to start with augmenting the creative process. Eventually, that enables us to teach machines to create.

Eventually, that enables us to teach machines to decide. That unlocks the world of generative UI and all of the amazing futures we want to build. So, we're going to be sharing a lot more in public soon on MEL and everything else that we do at AirFoil. You can scan the QR on screen to learn a little bit more about us or stay in touch. And Min and I will both be around the conference today and tomorrow. Thanks so much everybody. And now I would like to invite up to the stage Annie Lua, senior UX researcher at Hi

The Friction Worth Keeping: Designing AI for Subjective Decisions in Shopping & Travel

Annie Luo, UX Researcher, Google (AI Shopping)

Summary

  • Annie Luo (UX researcher, Google AI Shopping) argues first-wave AI optimized for utility (summarize, book, complete) but a new class of consumer AI is fundamentally subjective — fashion, travel, identity-laden decisions where the user doesn't know what they want until they've seen contrast. Removing friction here destroys the value.
  • Reframe: design for 'the deciding,' not 'the decision.' Taste, trust, and confidence are built through the process, not handed over at the end.
  • Google virtual try-on (custom image-gen model, live US/AU/ID): renders a jacket on the user's own photo so they build taste through comparison; even a no-purchase outcome is valuable because the user sharpened preferences.
  • Google Travel maps interface: wandering — not committing — is the feature; a chatbot would have committed you to skiing five prompts ago, a map lets you change your mind.
  • Metric shift: away from conversion/time-to-result toward 'felt more confident,' 'learned something about themselves,' 'came back to explore.'
  • Three design rules for subjective AI: (1) support comparison, don't single-shot a recommendation; (2) build intent-expression moments before answering; (3) invite active selection, don't auto-best-pick.
  • Thesis: there are three optimization moments where efficiency kills exploration, and those are exactly where friction must be put back — a consumer-AI counterweight to the coding-agent efficiency narrative dominating the rest of the conference.

Why this matters

A counterweight to the coding-agent efficiency narrative: when users don't know what they want until they see contrast (fashion, travel, identity-laden decisions), removing friction destroys the product. Annie Luo's reframe design for 'the deciding,' not 'the decision,' with metrics like 'felt more confident' replacing conversion is a reusable lens for any consumer AI surface where over-automation erodes the trust moments high-stakes decisions actually require. The same logic applies to payments UX, financial advice, and any agent that mediates judgment rather than execution.

Slides (9)

Slide notes

  • Slide: 'Subjective decisions need a different design instinct' — 3-column framework: 01 Taste ('People don't know what they want until they explore'), 02 Trust ('People trust the system when it shows, in small moments, that it gets them'), 03 Confidence ('People feel confident when they own the process, not just the outcome'). Sources: Preference construction (Lichtenstein & Slovic, 2006); Relational agents (Bickmore & Cassell, 2001) [t=136:53]
  • Virtual try-on UI demo: three phone mockups side-by-side showing same model in different outfits (burgundy dress, cream long coat, brown jacket + jeans) on a park background; each frame has Shop / share / save / refresh chips. Headline quote: 'Not a fan, but I learn something about myself' — supports the 'taste' / exploration thesis [t=139:17]
  • Slide: 'The friction worth putting back' — 'Three optimizations where efficiency costs exploration': (1) Support comparison, not one suggestion -\> Showing one match bypasses taste building; (2) Understand intent, not just quick results -\> Skipping the intake limits trust; (3) Invite active selection, not auto-match -\> Auto-selecting 'best match' undermines confidence [t=141:41]
  • Through-line: each Taste/Trust/Confidence pillar maps to one of the three 'friction worth putting back' design moves — comparison preserves taste, intake preserves trust, active selection preserves confidence [t=141:41]

Full transcript (1460 words)

Everyone, I'm Annie. I'm UX researcher at Google working on AI shopping. We've heard a lot about um coding agents and ways to get AI to do more faster with less friction. And I want to talk about the other side, a class of problems where efficiency isn't the goal and we actually need to keep some of the friction in for these everyday consumer AI products. So let's take a moment to think about this question. One that you might ask in front of a mirror. How do I look in this Underneath though, um, you might actually be asking about, does this reflect the person I want to be? Um, does this make me feel like because the fur jacket might be a little bit out of my daily range, um, am I brave enough to wear this or does it make me feel like I'm trying too hard?

So, these aren't prompts or search queries. These are the kind of questions that people quietly ask themselves when making a purchase So the first wave of AI worked by removing a lot of these frictions um for tasks like summarizing um a dock or booking the cheapest flights. These are the utility task where the success metrics is pretty obvious. Um you get the task done quickly and as AI is now being asked to help with a different class of problems where um the question is a lot more subjective like how do I look in this jacket? What kind of trip do I want? Um, these are subjective questions and the right answer depends on the person, the moment and even the mood and efficiency alone can't really tell if the feature is actually helpful anymore.

And so how do we design for this? When AI moves into helping people with these everyday decisions that are really personal and subjective, three things change. People don't actually know what they want until they've seen a range of versions in contrast. And that's how people build trust. And as AI becomes a thinking partner for a lot of these decisions that are a lot more personal, a different kind of trust has to be earned. If you think about um meeting a stylist for the first time, trust is built through the small talks that you guys had up front or the stylist commenting on something that you wear that day um instead of upfront giving you a range of recommendations for things that you like.

You wouldn't trust that the stylist actually know what you want. And so um it's really important because uh you trust them because they have signal in small ways through those little interactions that they understand your vibe and different from utility tasks where confidence for personal decisions um comes from the feeling like you have made the call and all of these aren't straightforward deliverables. These are things that AI has to help you build in the process. Um and so in the next few slides I love to use two domains to show how that looks like. Um in fashion and travel. First, this is um virtual tryon.

It's a Google shopping AI feature that I've been working on um for visualizing how clothes would look on you. Powered by a custom image generation model for fashion. We launched it last year in the US and in APAC. It's currently available for users in Australia, Indonesia, and And here's how it works. You're looking at a denim jacket and you upload a full body photo of yourself. So, I pick one of me in Central Park in New York and um and then the AI can render the jacket on you in your context rather than you having to imagine how that would look on you browsing a feed of products. And notice that the question AI helps with is not just figuring out is this is is this is like a nice jacket.

It's actually helping you visualize do I look good in this for a vibe check. And you can also see yourself in different jackets. Maybe I want to try a white one. And that's how you gradually build taste by seeing a range and compare. You don't really know you prefer the white one a lot more by um until you actually see this being right next to a blue one. And as you explore further, you might start to recognize patterns about yourself or find something that really surprises you. Maybe the the brown one actually looks really good. And AI that really built for supporting subjective decisions aren't really deciding for you, but they're giving you a surface to to discover your own taste.

And in this case, let's say I'm not really interested in any of these. I feel like I'm not a fan. Um but in in the utility frame, it feels like nothing happened here because um the user didn't buy. But subjectively, they got something super valuable because they sharpened their taste. I also learned something about myself. I don't really look in that um purple dress, which is equally valuable. In our next example, um let's take a look at travel as well, like where should I travel next? Again, the real questions underneath are subjective. Um, do I want to be challenged or just chill out and relax?

Or in this trip, do I want to be a museum person or do I want to be a beach person? A booking agent won't be able to help you answer that. And people plan trips partly to figure that out. And in Google Travel, we treat maps as a place to wander, not just a destination picker. And this is a reference point for the kind of interface that supports exploration, not just jumping to book me a ski trip for efficiency. And maybe you wonder, should I be a ski person this winter and you wanted to explore um Aspen or um Whistler and both are great skiing destinations in the US. Or maybe skiing doesn't feel right and now you're considering a totally different kind of trip.

Um, and so maybe you wanted to explore like the Yellowstone National Park or um, Yusede and now you're considering something totally different and a chatbot might have committed you to skiing five prompts ago, but a map interface lets you change your mind and explore with you. And that's a key difference. So, what both products have in common is this. They're not trying to um, give you the answers fast. They're trying to give you a better place to think. And that's why it's really important to design for the deciding not just a decision because um taste, trust and confidence, these are built through the process, not just um being handed over for you at the end.

And so that also means we have to measure a different set of things for metrics like task completion, um time to results, conversion. These are great for utility tasks. But for a different um class of problems that are a lot more subjectives, the ones that truly matters are harder to count. For things like the users feel more confident, did they learn something about myelves? Or did they return to explore more? These are the things that really count. And practically there are three kinds of optimizations where efficiency could cause explorations. And those are the moments where it's really important for us to put the friction back.

Um, and so for everyday consumer products where AI is supporting people with personal subjective decisions, it's really important for us to support comparison, not just giving one suggestions right away. Otherwise, we miss that meaningful moments to help people build trust. It's also really important to understand the intent not just um give them the quick results because we have to earn a different kinds of trust for and build in moments where people can express their intent or visual preferences and also show that AI gets your taste and gets the style that you want instead of assuming the intent right away.

Um and lastly, invite active selections. Um not just automatically giving you the best choice because the act of choosing is the point a lot of these exploration journeys that really make the journey fun, delightful. Um that's about self-discovery as well. And these are the frictions that are worth keeping. Um thank you. I love jamming on consumer products and I also write about this kind of thing at Substack. Happy to chat more after. All right, thank you so much everybody. Uh that was the conclusion of our first part of our morning session.

Framework Ergonomics for Agent-First Developers: Docs, Errors, and the Cost of Ownership

Jimmy Lai, Vercel (Lead, Next.js + React team)

Summary

  • Jimmy Lai (lead, Next.js + React at Vercel) frames the 2026 shift: framework maintainers no longer design for the person at the keyboard but for the agent sitting between them and the code.
  • Next.js weekly downloads went 4M to 42M in four years, largely on coding agents reading docs; 60% of Next.js docs are now served as markdown because agents are the primary consumer.
  • Cost curve has inverted: creation is cheap, ownership is expensive. A stale doc or misleading example used to confuse one developer; now it taints hundreds of thousands of projects and can poison training data.
  • Compiler errors that say 'open an issue' are dead weight — agents never file bug reports. Agents are unforgiving QA for API design (no Stack Overflow lore to paper over bad naming).
  • Internal rule: stop chatting with 10 agents in parallel. Invest upfront in prompts, evals, and safeguards so you can later scale to 100 background agents instead of bottlenecking on your own steering.
  • Macro warning: forking frameworks or rebuilding stack layers because AI made it cheap is a trap — see his React/Next.js shell vuln response — because attackers will find you and security researchers won't. Open source matters more, not less.
  • Closing thesis: the question stops being 'can we build this' (always yes) and becomes 'should this exist and do we want to own it long-term.' Use AI to delete as much as to ship.

Why this matters

When the lead maintainer of Next.js says 60% of his docs are now served as markdown because agents not humans are the primary readers, the surface area of 'developer experience' has officially shifted. Jimmy Lai's inverted cost curve (creation cheap, ownership expensive), his rule to invest in prompts/evals upfront so you can scale to 100 background agents instead of bottlenecking on 10, and his warning that forking frameworks because AI made it cheap is a balance-sheet trap are the practical operating principles for the agent-first era. The moat is the orchestration surface docs, errors, CLIs, explicit semantics not the underlying engine.

Slides (10)

Slide notes

  • [t=172:21] MY PREDICTIONS (3-panel): 01 We now build FOR agents \| 02 We now build WITH agents \| 03 Learn to pick what NOT to own — the closing thesis of the talk.
  • [t=166:28] "Documentation is the gospel for agents." Subtext (monospace): "And every stale doc in your own codebase is a confident bad answer waiting to happen." — stale docs =\> confidently wrong agent output.
  • [t=178:14] Closing slide: "▲ Ship what's next." (Vercel triangle) — tagline tying back to Jimmy's title line "Ship what's next."
  • Segment context (per brief): Next.js download graph cited at 4M → 42M downloads (agent-driven adoption); docs-as-markdown shipped so agents can consume framework knowledge directly; React/Next.js shell vulnerability case used as the example for "pick what not to own" — lean on framework maintainers for security-critical primitives rather than reimplementing them.

Full transcript (3453 words)

Hi, I'm Jimmy. I lead the NexJS and React team at Versell. So, since we're at at an AI conference, I'm kind of curious just, you know, how much people are familiar with like Nex.js and Versel in general. Okay, it's not not terrible. Then um for those who don't know what it is, it's like a web framework that people use to build like websites in general. It's not the topic of the talk, so we should be fine. Um but here's a fun number. When I joined NexJS four years ago, we were doing about like 4 million donors per week, and today we're basically up 42 million. Uh obviously a huge part of that is thanks to you know the incredible work that my team is doing but in reality uh in reality I think a huge part of that is because of like coding agents.

Um and as such I think that has changed a lot about like how we think about building um tooling for developers. Um because we spend a lot of time about obsessing about how people will build websites in the future and in the past six months it has caught up to us you know faster than we ever expected. Um, and this year we spent a lot of time talking about like, you know, how can we account for like this new type of like user? How do we stay ahead of the curve as a team? And like, do we still have a job in the future? And we're somewhat okay because I think maybe what we had to go through uh and realize is that like the job was never just about executing the tasks.

It's about deciding which task should even exist and whether or not we want to own the result of it. So yeah, in reality AI just has made like executing and building u much faster, but turns out you still have the same amount of time in your day. So in reality, it just makes ownership more expensive because you still need to account for like what you put out uh out there and you still need to like deal with like any issues that come with it. Um so today I want to share a few of those lessons with you because I think they apply to everyone in the room whether or not you're building with an agent for an agent or whether or not you're building your own agent.

Um my prediction is what we're learning about how agents use Nex.js JS will only grow more useful as agents become more widespread as um you start to using to use them for like anything else besides coding. Uh maybe it's just about like doing your online shopping for you as we already kind of see happen with like open flow. Uh so I want to talk about three things like just what's changed for us as we started building for agents. How agents also changed how we worked as a team and where I think the industry is going to go um and especially why open source matters more than ever now. So the weird thing about maintaining a framework in 2026 is that you're no longer designing for the person at the keyboard.

Uh you're designing for whatever is sitting between them and the code. So it used to be an editor the docs pages but now it's like an agent that you know has access to your whole computer maybe has too much permissions and that can be like a little bit dangerous and pricey too. So fun fact like 60% of like the next JS docs today are served as markdown. Um which means like not only coding agents but it's also like um like lab indexing um etc. But it just means that like we're not having much uh like manual clicks to the docs anymore. I mean if you think about it when was the last time you consulted docs yourself um it's always a bit faster.

There's less less friction now that just just in just asking like you know claude like how does this thing work in XJS? Um, and so we're moving to a world where software is kind of becoming the primary user of software. And I feel like that changes a If you think about it, an agent is kind of a an annoying kind of user. It reads exactly what you've wrote, copies example, it runs commands, it follows errors to the to the to the letter really. And so if the loop is broken, if your errors are not good, it's just not gonna, you know, like sort of stop, get some coffee, go to bed, and like wake up at 5 a.m.

With the fix in their head. They're just going to keep trying and burning money up until uh they get it fixed. And so that has been really useful for us to understand because it indicates where the human is still like required in the loop and what we should try to optimize. Um, so for example, documentation used to be like somewhat of a very passive thing. We we used to put it out there and we'd assume that like no one would like people would read it once in a while and like sort like build that knowledge over time and and you know I always prided myself on like having great docs for like the next site but the real return on investment was never really there.

We do it once in a while like we'd look at them and like say oh we're missing this feature. Um but nowadays like documentation has become like you know a bit of a a bit of a bible for agents. They'll take whatever you wrote in there and just act on it immediately. And it's not only applicable you know to sort like frameworks and like dev tools offer. It's like anything that's in your codebase like your readmies your PR your your playbooks every stale document that's in your codebase. It's kind of like a time bomb right? It's a it's like a hallucination waiting to happen. And the most dangerous ones that I've seen are like not the missing ones, right?

Like because if the agent has access to code, they can still figure it out. It it's like when the doc is like slightly misleading. Um what we've seen in practice is a stell example used to confuse, you know, one person at a time and now basically confuses like hundreds of thousands of projects before anyone really notices it. Um and worse for us it's when the bad information gets into the the data sets it means that the model is now potentially tainted. Um that definitely has happened for some NexJS features. Um and the same story for compiler errors. We we have sort like an error I think in next that says like this is a bug in NexJS please open an issue.

And this is like this kind of like a crime uh in this age because I've never seen to this day like an agent like open a bug report on on Nex.js. U like as as tool offers it's really important for us that we kind of ensure that the user always stays you know as fast as as unburdened as possible. In general this again this applies to like any types of like tools that you build um like you know like your banking websites that requires like 10 steps before you're able to like um send a payment or something. Um, so but it's just that like agents have made this much more important for us. Yeah, it's also a great way of like agents are like a nice way of like testing out whether or not you're designing properly because when you're coming up with like something new, an agent will basically have like no, you know, they don't have like Stack Overflow.

They don't have like Twitter lore to understand what what something is. And so if your if your API is like poorly designed, poorly named, then you're going to in for a bit of trouble, right? And one key part that I've learned that I think you should also apply in your work is that like any system should be as explicit as possible. When you're doing an action, you should really be thorough about like the way you can explain it. Like like those like you know 500 like status code that don't say anything you should be able to still debug it because we're kind of heading toward a world where um you know as like Sio was saying earlier about codex is that like agents are just going to run passively for everyone and so you want to enable your own systems to run and be fixed passively.

You want to you want them to be able to understand like you know if your if your prerogative is that like all of your websites should be fast you should be able to like define those and today we have like some metrics here and there but maybe you have a different definitions and so you should work through your codebase what that means what does it mean to be reliable? What does it mean to be um fast? What does it mean to be secure? Um so that in turn you're well set up for when the agents are able to like kind of run autonomously and ship the fixes Um so yeah that's that's kind of the first shift like how basically building for agents doesn't like replace the fundamentals it just makes them like you know even more important.

I wanted to share about how we leverage agents ourselves internally, right? Like I'm sure like you're all familiar like in the past six months um I think the industry has gotten like a little bit of like a a psychosis phase where like everyone was trying to build everything in the entire world. Uh I've certainly done so a lot thanks to uh oppus during Christmas and once we got past there uh I think like the realization was that the real work is you know the tastes and the judgment and and I think it's better to think of agents as like being able to help you for everything around it. Um as an engineer what's most valuable is my focus time.

Like I'm sure you've heard about this study that says, you know, if you're like disturbed for um a little bit, it takes like you it takes you like 30 minutes on average to get back to like a flow state. And in a world where it's, you know, extremely tempting to have like 10 agents uh running in the background, uh 10 chats open at the same time, like you know, how do you make sense of that? How do you stay productive in that world? Um because agents are still pretty powerful. They allow you to like, you know, research very quickly. Can turn like a messy investigation into like, you know, a really nice documentation spec.

Um, but my key advice here, learned from my my own hard experience, uh, is stop actually force yourself to stop chatting with them. You kind of want to put in the work now so that you can avoid having to like over steer an agent. Like it's kind of hard because I like I like having like 10 chats going on at the same time. It gives me like a little bit of like dopamine all the time. But think about the word where instead of like you know bottlenecking yourself with like 10 agents you can by putting the work now putting like the correct prompts in place and the correct like evolves and like safeguards.

You can this is kind of how you scale yourself to like having a hundred of like agents running in the background. Um yeah because that's just what's going to happen for us as an Um, so that's the version of like AI that I'm very excited about, right? Um, but there's also another part to it where like the industry does not only reward judgment, right, but it also rewards like motion. Uh, and EA AI creates a lot of that. So I want to talk about the honestly that's probably the the most important thing. It's like knowing when not to use AI. Uh, someone said to me in the past six months that the the last six months have felt like the most exhausting five years of their life.

Um, and that sounds kind of about right. Every week there's like a new model, a new demo, a new feature that comes out. And you know, my natural reflex, and I'm sure you've felt this as well, is that like, well, you should basically do the same thing. You should like generate more code. You should ship like more features um so that you can bid out the competitors um so that you can stay ahead of the curve. But in reality and and we know this from like you know having like built like developer toolings for like the the past 10 years is that what's going to happen is that you're just like speedrunning tech depth.

You're every demos that you have shipped you know in the past six months maybe you've felt it now. Um now you actually have to deal with it and with that comes a lot of like other problems like uh observability um pricing uh making sure that like everything stays like really reliable. Um, and for example, like you know, you can fork NexJS over a weekend with like uh, you know, a bunch of like tokens. Uh, but just because you can doesn't mean that you should. Uh, because when you fork a framework for example, you start owning everything that comes with it. And like maybe the the biggest recent example of that is security issues.

Um last year I ended up like leading the response to like react to shell which was um you know a very critical issue and like a very critical vulnerability and this is kind of what I think this is kind of like the extreme of like ownership is like we said you know we released um NexJS to the world like a few a few years ago we keep releasing it and all of a sudden I still we still had to fix it um for like you know the the hundred like you know thousands of like users that we have and this is where I you know would caution you to if you think that like you can like to fully replace abstract away some place of your stack um think about yeah like how how are people going to deal with it in a year's time when you've created your own like meta framework um to serve your websites do you actually want to take care of this or maybe it's better to use like open source and make sure that like you're you know giving back to the community and potentially like helping other people have like more secure websites.

A note by the way on security, right, is we're kind of living in like unprecedented unprecedented precedented times um in terms of like uh vulnerabilities being disclosed every every month or so, right? Because AI has made it insanely easier and it looks like you're not using like secure software, but it's actually the reverse. It means the system is working. If you're getting like security patches, it means like somebody reported it. Um the other alternative here is that you're building your own version of like your framework or your tooling. Um but they don't get like the attention the attention of like security researchers.

And so now attackers will just like you know identify that you're you're running your own your own stack and they will like basically attack you without you knowing. And um this is where open source really matters because we get to sort like build on like stable foundations together. Um so at Verscell we kind of have a saying which is like you you know you can just ship things and this this was pretty great like we built like you know an insane amount of like really good products out of this. Um but since the start of the year we've also started taking like the other approach too which which is the fact that like you can also use AI to just delete things.

Um because before like shipping was you know it just meant winning like you just you could know could have like features features features um in your road map and but now that's like it has become so cheap I think what's going to put you ahead of the conversation um and the competition is focus like because you're going to have to deal with it your users are going to have to deal with it it's I would actually rather slow down take you you know, take time to reflect on like what what is actually what makes your product different. Um, because if you can build like a feature in like an hour with a few tokens, your competitors can do the same, too.

So, and yeah, and what I'm saying is not about like not shipping at all, right? It's more about like how you protect yourself from like and how you protect your ability to keep shipping. Um, the question is not yeah, can we build this anymore? Because the answer is like always yes. The harder question and the one that like actually decides whether or not something is worth doing is like should this exist and are we actually willing to own it like long term. Um the sun yeah when I was working at Meta we had a thing called like the not invented here syndrome uh where people would just actually rebuild like every every library is possible on the planet.

Um there there used to be like um people are pretty familiar with like React Native for example for serving web um mobile mobile apps with React. Um funnily there's like three versions of this internally at Meta just because people did not want to control this and that was already a problem back then. Um, it's becoming more and more of a problem now for like everyone. And again, uh, when you think about like spinning up your own, uh, so like your own products that like replaces like something in your stack, think about yeah, the the mental burden that that's going to come with it. Um, so yeah.

Um, as quick recap, I guess like my my my predictions, right, is like if you're building with an agent, like what's really important is that you think about what's not on the happy path. Like, you know, can your users actually fully use your tools without like prompting themselves? Like make sure that like your docs, your errors, your um CLIs are well well defined. If you're building with agents, be really careful about like outsourcing judgments there. Um you can use them to you know get context closer to the judgment like you can do the research you can you know investigate repros specs like investigate like performance issues.

Um but you know yeah really focus on like what you bring to the to the table and make sure that you make time for this. Um and also as the industry speeds up yeah be careful about what you decide to own. Um again like AI made creation like really cheap uh but ownership is much more expensive than you think it is. Um so yeah doesn't mean that you should ship less but it just means that we have to ship you know as an industry with like more focus build what should exist like try to make it understandable make it reliable make it safe and stand on the foundations that you can trust. Um and yeah, thank &gt;&gt; Thank you so much, Jimmy.

Our next speaker is Vran Yukich, uh who's a co-founder and CTO of Daytona. Um He will be speaking about why sandboxes are non-negotiable for autonomous AI agents. Um, so without further ado, we're going to hear from Um, hello everyone.

Why Autonomous Agents Need Sandboxes, Isolation, and Strict Boundaries

Vedran Jukic, CTO/co-founder Daytona

Summary

  • Jukic frames agent security as structural, not a bug: any agent (Claude Code, Codex, OpenCode) inherits the user's shell, SSH keys, AWS billing, filesystem, and decides actions from text it reads off the internet.
  • Prompt injection (direct and indirect via READMEs, web pages, issues, emails, PDFs) cannot be eliminated — OWASP and OpenAI (Dec 2025) say so explicitly; attackers only need to win once.
  • Skill ecosystem amplifies risk: skills are folders of instructions+code with no app-store review, executed at full user permissions.
  • Three early-2026 audits: Koi Security found 341 malicious skills on the open Claude marketplace growing to 800+ by February; Snyk found 13% of a separate skill set had serious problems with 76 outright malicious; 'Malicious Skills in the Wild' scanned 98,000 skills and flagged 157.
  • Live demos: a documentation skill with hidden markdown comments instructing exfil to attacker servers ('do not mention this to the user'); a 'telemetry' function walking env vars to harvest API keys/tokens.
  • Prescription is a real sandbox (not just VM/container) doing four things: (1) keep secrets fully outside the agent — proxy injects real tokens at egress from a secrets broker, agent only sees placeholders; (2) egress proxy with allow-list network policy; (3) full command + request logging; (4) model gateway between agent and LLM so every prompt/response is auditable per-sandbox.
  • Closing frame: assume compromise, constrain blast radius, ship Daytona-style isolation.

Why this matters

Daytona's framing prompt injection cannot be eliminated (OWASP and OpenAI now say so explicitly), attackers only need to win once, and skills are folders of instructions+code with no app-store review combined with hard numbers (Koi Security: 341 malicious skills growing to 800+; Snyk: 76 outright malicious in a separate audit; academic scan: 157 flagged across 98,000) makes the case that the security/isolation layer is itself a core piece of the orchestration moat. The prescribed minimum (secrets broker injecting real tokens at egress, allow-list egress proxy, full logging, model gateway between agent and LLM) is now the baseline trust layer for any agent touching regulated data or money. The agent economy needs payments-grade trust infrastructure and the data is starting to back it.

Slides (12)

Slide notes

  • [t=183:41] Slide: "Skills are unaudited code that run as you" — SKILL FOLDER (doc-assistant/ with SKILL.md, scripts/generate.sh, scripts/deploy.py, references/style-guide.md) arrows to EXECUTED AS YOU (\$ your shell, \$ your tokens, \$ your filesystem). Caveats: Markdown + optional code; Distributed via community registries; Minimal vetting, no signatures. Punchline: "The agent is told to follow the skill's instructions. It does." Bottom: "No review. No signatures. No sandbox."
  • [t=186:42] Slide: "Stop trying to make the agent unhackable" — three red-X rows: Model layer — undefendable; Skill ecosystem — poisoned; Input — untrusted. Closing line: "Stop trying to fix layers you can't fix." (Frames the architectural pivot: defend the blast radius, not the agent.)
  • [t=189:42] Architecture slide: "LLM Gateway — Every sandbox routes model calls through the same gateway. Every prompt, every completion, every tool call — logged." Three SANDBOXES (sandbox-a: fix issue #482 in acme/api; sandbox-b: review PR in acme/frontend; sandbox-c: scan deps in acme/data) all flow "prompt + context" into a central LLM GATEWAY box showing per-call logs (claude-opus -\> 3 tool calls -\> 2.1k tokens; gpt-5 -\> 1 tool call -\> 800 tokens; claude-sonnet -\> 5 tool calls -\> 4.7k tokens), fanning out to MODEL PROVIDERS: Anthropic, OpenAI, Google. Footer: "Every prompt. Every completion. Tagged to its sandbox. On the record."
  • Through-line of segment: the supply-chain/skills threat data (Koi 800 malicious skills, Snyk 13% problematic, 98k/157 paper) is unwinnable at the agent/model/skill layer — answer is sandbox-level mediation: secrets broker + egress proxy + LLM gateway giving per-sandbox auditability and tagged provenance for every model interaction.

Full transcript (1245 words)

It's great to see you all here. Uh, my name is Weather Nich and I'm the CTO and co-founder at Daytona and Singapore is our number one city by users worldwide and top five countries. So thank you for for that. And today I'm going to talk about uh why you should run your autonomous agents inside sandboxed environments. So when you install cloud code, codeex, open code or any agent that uses tools, you give it a lot. It runs as you. It can read your files. It can use your SSH keys. It can spend your AWS bill. It can delete things. And it decides what to do based on text that it reads from the internet.

So we said yes to this because the productivity is real. But most of us never thought about risks. So why would the agent do things that it is not supposed to do? Well, because it can get compromised easily. Um, prompt injection is when someone hides instructions in text that agent reads. And there are two kinds. Direct when the attacker types the bad prompt and indirect when the bad prompt is hidden some in something that the agent reads. It can be a web page, a rhythm file or an email. And indirect are the dangerous ones because the autonomous agent reads the internet. It's its job. Fortunately, uh models are getting better at spotting this, but they don't actually catch them And remember that attacker needs to succeed only once.

So, OASP says, OASP says it plainly. Uh, the prompt injection cannot be fully prevented. It's how the models work. Open AI said the same thing in December. So, the people who are building the models are actually telling you they they cannot stop this. It's not a bug. It's how the technology works. And we also use skills, right? So a skill is a folder with some instructions and maybe some code in it and you download it. You give it to your agent and your agent will run it with all your permissions with your shell with your tokens with your files. And remember there is no app store review for skills.

So there's no sandbox between the skill and your machine. The agent will read instructions from the skill and will does what they they Um if you look at the numbers they're they're not looking good. Three different teams have looked at this in early 2026 and KO security checked open cloud the skill marketplace for the cloud agent. They found 341 bad skill and by February that number grew to over 800 bad skills. Sneaked checked another set and they found 13% of skills that had serious problems and with 76 were clearly malicious. Also research paper titled malicious skills in the wild checked 98,000 uh skills across different marketplaces and they found 157 of them are bad.

Now let's let's see uh an example of a real malicious skill from from that re research. And it looks like a normal documentation helper, but there's a hidden comment in the markdown, right? And if you if you preview the file, you don't see it, but the agent does. And the comment tells the agent to send your project files to an attacker server, right? So, and the funny thing, the last line in the comment says like, do not mention this to the user. Um, some skills ship real code and this one looks like a normal telemetry function, but if you look, it it collects some data and and it sends it to an analytics endpoint.

But if you look closer, uh, what it's really interested is are your API keys, your secrets, your tokens, and it will walk through your environment variables. It will pick out the credentials and will ship them out. So to get compromised, you don't actually need to install anything. Um the agent can read any readme file, an issue or an email, even a PDF and any of them can contain malicious instructions. So instead of trying to to prevent this, we should accept that this is a reality. The model cannot be fixed and OASP and Open AI said so. A new skill ecosystem is is already full of of bad stuff and new malicious skills are appearing at a rate that no one can really review them.

So any readme file, any ticket, any email the agent reads can can hold malicious instructions. So what can we do? We can we can change what the agent has access to. So no host shell, no host files, no credentials and we can restrict the internet and also we can throw it all away when the task Sandbox is not just a virtual machine or a container. The agent inside still has your access token is still has open internet. A real sandbox does four things. One, it keeps your secrets outside of the agent so that agent never sees them. Two, it controls what the agent can access on the internet or inside your local infrastructure.

Three, it logs everything, every command and every request. And four, it sits between the agent and the AI model. So you can see what the agent has asked and what the model responded. A real sandbox has the network Every outbound request goes through a proxy that checks every request against the allow list. And a request to random endpoint will get rejected. Also, everything is logged. So if something goes wrong, you can clearly see it in the logs. The agent should never see your secrets. The secrets should live outside the sandbox. So when the agent makes a request, for example, to GitHub, it sends the token placeholder value instead of the real token and the proxy will catch that on the way out and will get the real value from the secrets broker and it will send it to GitHub.

The response will come back through the proxy to the sandbox. So the agent will get what it asked for but without ever knowing the token. And if the agent gets compromised, there's there's no token to to leak because the token was never exposed to the agent. And finally, the model is the brain to the agent. If you can't see what's what what goes in and what goes out, you have no audit trail. So every sandbox routes its model calls through the same gateway and every prompt and every response gets logged in in in the gateway. So when when for example sandbox A starts behaving strangely, you don't have to to to to guess what's going on.

You can just open trace for sandbox A and you can clearly see what the agent asked and how the model responded. So the reality is that the agent will get compromised. The only question is what it can reach when it does. So you can try to build your perfect agent or you can put it in a sandbox and sleep well. So choose nice uh choose wisely. Thank you. Thank you van. Next up we have Vashant Kameeshwaran who's a co-founder of Gravile along with Rohan who's also from Grapile. They're going to be talking about what they have learned from analyzing 5 million vibecoded PRs. Um so once they are set up and ready to go, we'll have both of them talk about that.

Hello everyone.

What 5 Million Vibe-Coded PRs Reveal About Agent Bug Profiles

Vaishant Kameswaran (Co-founder/CTO) & Rohan Kumar (Researcher), Greptile

Summary

  • Greptile reviews ~4B lines of code/month for Nvidia, Coinbase, Meta and runs PR-review agents flagging ~100k bugs/day; dataset is 5M PRs.
  • Detect 'vibe-coded' PRs via three signals: bot co-author field (~1% recall), bot-authored PR description boilerplate (~20%), tool-prefixed branch names (Codex, Cursor).
  • As of April 2026, 27.6% of PRs show strong evidence of being fully agent-authored; the curve is steepening.
  • Quality tested on three metrics: revert rate, severity distribution of Greptile comments (P0 to P2 nits), rounds-of-review-to-merge. No single agent dominates — Claude and Devon merge faster, Claude reverts less; most bots produce fewer P0 critical bugs than humans on average but full distribution is mixed.
  • Sharper insight: bots produce differently shaped bugs, not strictly better/worse — Cursor background agents over-index on N+1 query errors, Claude agents over-index on missing tenant-check errors.
  • Net-net agentic coding is a win on throughput at similar (differently-shaped) quality, but validation pipelines must be re-tooled around model-specific bug fingerprints.
  • Greptile's positioning: the review/validation layer is the moat in an agentic-codegen world.

Why this matters

Greptile's dataset (5M PRs, ~4B lines/month reviewed for Nvidia, Coinbase, Meta) puts hard numbers on the agentic-coding transition: 27.6% of April 2026 PRs are fully agent-authored, the curve is steepening, and bots produce differently shaped bugs rather than strictly better or worse ones (Cursor over-indexes on N+1 queries; Claude over-indexes on missing tenant checks). The takeaway for builders: when models commoditize codegen, the durable layer is the validation harness that knows each model's bug fingerprint and 'missing tenant check' is exactly the failure mode that destroys multi-tenant SaaS and billing systems.

Slides (8)

Slide notes

  • Signal 3: Branch name prefix — branch naming convention as a fingerprint for which AI agent authored a PR [t=193:42]
  • Mock branch list (5 of 1,284) shows tool-prefixed branches highlighted: codex/add-rate-limit-middleware (1 day ago) and cursor/refactor-billing-types (3 hours ago) sit alongside human-style branches feature/payment-retry, fix/login-redirect, chore/bump-deps [t=193:42]
  • Implication: agents like Codex and Cursor leave detectable prefixes in branch names, giving Greptile a third signal (after commit/PR fingerprints) to attribute PRs to specific coding agents at scale [t=193:42]
  • Slide footer: deck position 08/25, Greptile branding; talk frame reiterates 'What We Learned From Analyzing Five Million Vibecoded PRs' [t=193:42]
  • Note: frames at t=196:41 and t=199:40 were not retrievable, so revert-rate / severity / rounds-of-review chart content for this segment could not be extracted from imagery

Full transcript (1670 words)

Uh, I'm Vishant, co-founder and CTO of Greile. &gt;&gt; And hi, I'm Rohan. I'm a researcher at &gt;&gt; And today we're going to be talking about what we learned from analyzing 5 million vibecoded PRs. So, at Reptile, we're building AI agents that review and test pull requests. We're reviewing four billion lines of code every month for companies like Nvidia, Coinbase, and Meta. And there are 100,000 bugs that are identified by Reptile and fixed every single day. AI agents have evolved a lot over the last few years. In 2023, we were still working with, you know, quite simple agents that are able to generate short code snippets for us.

In 2024, we started to see the rise of agents that are able to make small multifile changes. And since 2025, we've entered a new age of fully agentic coding. AI agents are now able to create uh to go directly from spec to PR. But this leads us to wonder, are these fully vibecoded PRs actually any good? How are they being adapted adopted by industry? And in what ways are they succeeding or failing? So we have over 5 million PRs in our database. So we're well equipped to answer this problem. Uh and the first thing we need to figure out is how you know if a PR is vioded. Uh and so we rely on three key signals to figure this out.

Uh the first is the GitHub author field. So uh often bots will just add themselves as co-authors to your commit. And this is a very surefire way to tell whether the bot uh vibe coded the PR. Uh that being said, it's a pretty sparse signal. Only about 1% of PRs in our database were able to be identified this way. And so obviously many more than 1% of PRs are vioded. We needed a stronger signal. And for that we move to looking at the PR descriptions themselves. Uh often bots will add uh notes in the PR description saying that they contributed to the PR and that's another helpful tell that the bots coded the PR.

Uh this was a much more frequent signal. About 20% of PRs in a database were able to be identified this way. And lastly, if you've used codeex or cursor recently, you'll know that any branches that they create uh will have their names in the prefix of the branch. And so this is a very easy tell as well because humans are unlikely to make branches with these names. Uh and so putting these three signals together, we found that about 27.6% of PRs that were written in April uh had strong evidence of being fully vioded. And that's a very interesting number. But it's even more interesting if you look at the history of this number uh since the beginning of multifile agent systems.

As you can see, it's been going up rapidly and we expect it to continue to go up rapidly. Fully agentic software engineering is the future after all. And so, if this really is our future, it begs the question, are these PRs any good? Uh, you know, do we expect to see a significant degradation in the quality of code uh because of this reliance on agentic systems or are they actually better at writing code than human beings and we're too scared to admit it? So in order to answer this question, we first have to ask ourselves, what does it even mean for a PR to be good? And we tried to quantify this in a few different ways.

The first met metric that we looked at was the revert rate of these PRs. Uh typically when a PR is reverted, it means that it caused breaking changes in production or caused issues downstream. So we looked at the breakdown of revert rates by author and we found that some agents were actually able to uh have their PR reverted at lower rates than the human baseline uh namely cla and We then also broke this down by the size of the PR as measured by the number of files that were changed. Interestingly we found that AI agents on average uh had their PRs reverted at lower rates than humans the larger the PRs got.

Another very interesting signal of the quality of a PR is in the gre comments that it receives. So as Vishan mentioned at Grapile, we review pull requests and in the process of reviewing those pull requests, Grapile leaves comments on your code like a human would. Now Grappile also rates those comments on a scale of P 0 to P2 where P 0 is a critical codereing change and P2 is a nit. Now you can imagine that if a PR receives many P zeros or many critical bugs, then that is a lower quality PR than one that receives only a few knits or no comments at all from reptile. And so uh to look at this as a metric, we broke down the the severity of bugs produced by each bot um and looked at that compared to human baseline.

As you can see, each bot uh or rather the majority of bots produced fewer critical bugs than humans on average. And this is interesting. It means that you know on average if you're hoping to avoid codereing changes things that will take down production uh bots are actually more reliable. That being said if you look at the entire distribution of severities only some bots were able to avoid bugs of all severities compared to the human baseline. So again it's unclear whether on aggregate bots are better or worse than humans at writing The third metric we looked at is how many rounds of review it took for these PRs to get merged.

Here we defined a review round as essentially uh the bot opening a PR, a human leaving feedback in the form of comments on that PR and then the bot going back and making changes to address those concerns. This helps us understand two different things. One is how good the bots are at writing good code uh the first time through and the second is how well they're able to incorporate feedback and make changes without introducing new We once again broke this down by bot author and we found that a few bots were actually able to get their PRs merged on average more quickly than humans. Namely, Devon and Claude we found uh were the best on this metric.

So, so far we've looked at a couple different metrics for whether bots are better or worse at writing code than humans, and we've seen that there's no real conclusion. Some bots are better at writing code than humans based on some metrics, but they lose on other metrics. The winner is kind of unstable. It depends on what you're measuring. Uh, and so perhaps the right question to ask is not are agents in aggregate better at writing code than humans, but perhaps the question to ask is how do bots produce bugs? Do they look different than human beings and in what ways? And so to investigate this question further, we looked at the breakdown of different bugs that each bot makes compared to a human baseline.

So namely uh if you compare the rate of bugs to the humans, red over here means that the the bots make more of that type of bug compared to humans and blue means that they make less and the intensity of the color uh corresponds to the magnitude of that change. Now as you can see uh the kinds of bugs that each bot makes vary widely depending on the bot. So for example, cursor background agents are much more likely to make N plus1 query errors, whereas clawed agents are much more likely to make missing tenant check errors. There is no one clear bot that necessarily wins across every single metric.

And you know the shape of each bot looks different. Now what we've learned here is that bots make different kinds of bugs than humans. Not necessarily better or worse by all the metrics that we looked at before, but different. And so one thing that we haven't talked about yet is that bots just allow you to ship code much faster. So if the quality is roughly the same, albeit differently shaped, and the magnitude is greater, then I guess we can say that AI code agents are actually good. They allow you to write much more code, except that you have to be mindful of the kinds of bugs that they make.

AI agents are writing more code than ever. And as Rohan mentioned, the shapes of bugs that they create is different from humans. It's clear that as AI coding uh scales into the future, your your code validation systems need to adapt and scale as well for the AI agentic future. At Reptile, we're helping thousands of companies manage their everinccreasing scale of AI code using AI code review. We spend a lot of time understanding the strengths and weaknesses of the individual models. So that we can use them in tandem to help catch more bugs and create better quality code for If you're interested in learning more about what Gretell does, check out our website at guptell.com.

Uh, and if you're interested in chatting with us some more about what the future of AI coding and AI code review might look like, uh, please come find us uh, at our booth and we'd be happy to chat some more. Thank you so much. Thank you so much. Next up, we have Yunong Zang, who's a research consultant with Sonar. Uh, Yunong will be talking about AI agents in your code quality pipeline, uh, shipping, securing, and measuring them. Um, Yunong, stage is yours. Right um good morning everyone um

Code Quality Agents in the AI Pipeline: Remediation Loops and Executable Evals

Yuntong Zhang (Sonar; final-year PhD)

Summary

  • Yuntong walks through two production Sonar systems for agent-driven code workflows.
  • SonarQube Remediation Agent (open beta) ingests SonarQube findings on a PR and opens a follow-up PR with hunk-by-hunk patches plus rationale.
  • Security architecture takes sandboxing as table stakes then layers defense-in-depth: constrained workflow (no free terminal, no arbitrary internet); codebase-as-attack-surface (malicious PR contributors injecting prompt-style commands are neutralized by replacing commands with identifiers and swapping back post-run); import guards against typosquat/supply-chain pulls from npm/pip.
  • After patch generation, SonarQube re-runs on the agent's output and forces retry with regression feedback until the quality gate passes — only then surfaced to the dev.
  • CRAP benchmark evaluates AI code reviewers: existing methods (token similarity, line-localization, LLM-as-judge) are noisy, so CRAP converts each historical human review comment into an executable test, then measures whether a coding agent acting on the AI reviewer's comments produces code that passes those tests.
  • Headline: across all current AI review tools combined, they catch only 41.5% of what human reviewers flag.
  • Decomposed by category: AI reviewers strong on robustness, edge cases, testing; weak on maintainability and design — humans win on contextual codebase knowledge.
  • Takeaway: layered review with AI as first pass and humans on the categories AI misses.

Why this matters

Sonar's two systems crystallize the agent-quality stack: a remediation agent with defense-in-depth (constrained workflow, codebase-as-attack-surface treatment, import guards, re-run-the-analyzer quality gate forcing retry until tests pass) and CRAP, a benchmark that converts historical human review comments into executable tests and measures whether agents acting on AI reviewer feedback produce passing code. The headline number AI reviewers catch only 41.5% of what human reviewers flag, strong on robustness and edge cases but weak on maintainability and design is the operator-useful datapoint for sizing how aggressively to lean on AI review, and the executable-eval pattern (turn implicit judgment into deterministic tests) is a generalizable primitive for any domain where reviewer expertise needs to be operationalized.

Slides (8)

Slide notes

  • [t=204:05] Slide 4 'Controls that make this safe on production code' — 3 guardrails for SonarQube Remediation Agent: (1) Constrained workflow — not a free terminal; predefined pipeline, no shell access, narrow tool surface per stage. (2) Codebase is an attack surface — prompt injection via comments; walk the AST and replace every comment with a UUID before retrieval, swap back at the end. (3) Slop-squatting guard — hallucinated imports; reject any patch that introduces an unknown package, kills the supply-chain attack class.
  • [t=207:37] Slide 8 c-CRAB benchmark methodology — each human PR review becomes an executable test. Three inputs: (1) The PR line (e.g. `if isinstance(keyboard[0][0], Sequence) and not isinstance(...)`); (2) Human reviewer comment ('This will raise on malformed inputs like [1], instead of safely returning False.'); (3) Auto-generated test (`def test_check_keyboard(): keyboard=[1]; try: result=check(...) except: ...; assert result is False`).
  • [t=207:37] Grading protocol: Show the PR to the AI reviewer -\> hand its comment to a coding agent -\> run the generated test. Test passes = review caught the bug; test fails = review missed the bug. Converts subjective review quality into binary pass/fail signal.
  • [t=211:08] Slide 11 closing: 'Shape the agent around the human — on both sides.' Ship side = Controls before capability (SonarQube Remediation Agent QR). Review side = Humans + AI, together (c-CRAB benchmark QR). Contact: Yuntong Zhang, linkedin.com/in/ytz.

Full transcript (2071 words)

I'm in uh so I'm a research consultant from Sonar and I'm also a final year Ph student at a US um so today I'll be talking about uh a AI agents in your code quality pipeline um specifically I'll talk about how do we secure And also how do you review the changes made by these coding agents. Um so um opinions here are my own does not reflect any of sonar and i um for the standard disclaimer disclaimer. So um here uh is a very high level diagram. So if we think about how the code are making are being made and being merged into repositories. Uh these are roughly the very high level three steps. So the agents will write code and almost always now agents will review the code because there are just too much of them and then human may decide whether to merge in merge them in or not.

Um so today I'm going to talk about two aspects within this pipeline. First is when agents are writing code how do we build an agent in sonar called sonar remediation agent uh which fixes sonar cube issues and then I will talk about how do we evaluate code reviews generated by agents in a more reliable Um so um so here is the first part it's the uh sonar cube remediation agent. So um basically the workflow is that sonar cube which many of you know is a very widely used study analyzer to scan your code. So sonar cube will find all your issues in your PR and then you can invoke sonar cube remission agent to automatically generate patch for you.

So this agent will open new PR on top of your existing one and then suggest changes to improve it. So the screenshot on the right shows that uh shows what this agent will look like. So you open this PR tells you about which issues in solo cube it has fixed and then um give you these patches hunk by hunk and tells you an explanation of why this patch is fixing this issue. Um so we have released this in open beta and we have re received a lot of feedback from uh So um the one thing I want to talk about more today is how do we secure these agents uh when we go when we put them into production. Um so because these agents work on a lot of enterprise code we want to make sure that there's really no security issues when we deploy and run these agents.

So we have heard about sandboxing agents that is very important uh we use it when we deploy this. Uh but I want to what I want to say is that we also want to build security in depth. That means we build layer security after we deploy sandbox. We also built in security within the agent and after the agent ship the code. So uh here are a few things that we we have done within the agent. So one thing is that we are building a very constrained workflow for this agent because we know that it's going to work on a very concrete scenario which is fixing the son issues. So there's no free terminal meaning that the agent cannot just randomly assess internet and execute arbitrary commands.

And also we consider codebase as an attack surface not just MCPS and skills but the codebase. Uh so imagine if someone uh opensource comm uh contributor is open a PR in your repo and then this this person turns out to be malicious. So they can actually inject malicious commands in the PR that they send to your repository. So um that's one thing we considered. So when we actually run this agent, we'll replace all of these commands uh into some other uh identifier and then swap back these commands after the agent is done. Um and also um we we uh we want to deal with this supply chain attacks. So this is for the scenario where uh if I'm a malicious actor and I'm um stop squatting and running repo in pip and I want to avoid agent from importing those kind of repositories.

So we build a lot of import guards to make sure that the agent does not import these malicious Um so that's what happens inside agent and here is how we verify the agents patch after it's done. So um when agent generates a patch we run solar cube analyzer again on this agent generate patch and then if we find regression or we find any security issues the agents is tked to retry with feedback from the previous iteration and then it's only sent to to developers when the quality gate passes. So that that was the first part uh I want to talk about. Um so now we we want to switch gears to how do we evaluate this code reviews.

So this is becoming a real bottleneck now because agents are making a lot of PRs to your repositories and then it just impossible for human to to review all of these PRs. So a natural way is to use uh AI review tools to help you review the the PR but there are so many of them and how do we know which one are better at your specific use case and how do we reliably evaluate them. So this is a question we want to research on. Um so here are what uh existing methods do. So uh if you consider a scenario where we have some historical PRs, human have made some comments on them and then we run this AI review tools and we want to see whether AI tools are catching the same error as human did.

So of course if they catch more similar errors they're better. So here are a few metrics of what people previously did. One is that we can check text similarity. We can see whether the area review tools in natural language are generating similar tokens in semantics compared to humans. But as you know if we even if we point to the same issue this issue can be word can be worded in very different ways. So this token similarity sometimes doesn't work and also u we can also consider localization. This means that taking a PR we compare each line at the location where these bots and also the humans are making comments and we say that uh the bots are good if they're making the same commands at the same location as humans.

But again this does not tell you the semantics that's only the location. Um also easy way you can take LM as a judge. Uh you can ask the language model whether these two commands the same. Uh so it works sometimes but it's hard to tell whether they are really reliable or not. Uh so uh the the gap here is that we want somewhat determines the way of checking whether a real command is good. Um so um this is what we did. So we we built a new benchmark uh called Crap. So it also takes works in a similar scenario meaning that we want to check whether AI generated reviews are catching similar issues as humans but the core idea is that instead of using language model component we are turning every human review into one executable test.

So here is a concrete example taking this PR line on the left uh if this is what someone changed the the codebase and the human review will say that okay this may raise some more inputs like this. So instead of safely returning false so this is a concrete improvemental code uh that human review has human review has suggested. So uh correspond to this we will generate a test on the right. So this test basically corresponds to this review command meaning that if this review command is addressed this test will pass otherwise this test will fail. So now now for all these PRs we don't have human reviews anymore.

We have all these executable tests. Now the second part is how do we actually evaluate AI reviewers based on this input. So we show the the AI review tools the PR ask you to make comments and then we take another coding agents to improve the code based on these comments. Now we have a different version of code which is improved according to the AI based commands. Then we run this executable test to check whether this updated version code is good and how many of these tests are passed. So in this way we can tell how many of the human commands the air reviews tools has have catched. Um so um here is the results we have gotten.

Um so the concrete numbers for each tool is not that important because these review tools are getting better and these language models are getting better every day and this these numbers are are obtained in the early of 2006. Uh so the part I want to highlight is this number on the right. So if we consider all of these review tools together, they addressed 41.5% of what human review human reviewers have pointed out. So this actually means the these um the current review tools does not capture even half of what the human reviewers has pointed out in in the Uh but this is not the full story. So other than this number, we actually look into all these AI generated commands and see the quality of them because they can also point out to other errors that humans did not identify but they are still valuable.

So uh we further look into all these review comments generated by humans and AIS and put a categorization around them. So this categorization is beyond bug fixes. So we basically put them around security, efficiency, compatibility, robustness and so on all the way to about documentation and the maintainability of the code. So uh and this diagram shows uh what how each review tools is doing compared to human reviews. So we can see that AI is actually doing very well on robustness of and testing. So they will suggest you to test more code. They'll point out edge cases in code and ask you to add them.

So this aligns with my personal experience as well. So AI is very good at pointing out things that I didn't notice before. But then on the other hand, human reviews are very good at maintainability and design compared to AI. So so they they'll talk about how this code is not maintainable anymore if you add so much changes. You should and you should organize code in a different way because human reviewers has more contextual knowledge about the codebase compared to AI reviews. So um the takeaway here is that for now uh we still we still should use AI and human reviews together. Probably AI reviews should use as the first layer and then human reviews can look at these specific categories that AI reviewers were not so good at.

Yeah. So uh yeah that's all I want to talk about. Um so I I talked about how do we build agents where we focus more on control and safety and also how AI and human should work together for code review task for now pro probably in the future we can have AI tools which are trained focus more on these aspects that they are missing right now but for now I think this should be a solution that we build layered reviewed uh on our codebase. So these two are the QR codes to our paper. So we have a research paper on each of these topic. Um feel free to read if you're interested. Uh and happy to chat after.

Thank you so much. Next up, we have Singapore's very own Eugene Chia from Featherless, who will be talking about how open-source models are here now, and it's time for Singapore to build Apologies for the technical difficulty. Um to me AG when AGI is actually truly solved, right? These things will be resolved and so is it printers like it should never happen. All right.

Open Models Are Here — Why Singapore Should Just Build

Eugene Cheah, Founder/CEO Featherless AI (co-lead RWKV, Linux Foundation)

Summary

  • Eugene Cheah opens with a live demo: Qwen 27B and Gemma 3 1B (running locally on a laptop via LM Studio + Cline in VS Code) building a working asteroids-style web game.
  • Featherless AI is a platform offering instant access to 30,000 open-source models today (targeting 2-3M), surfaced via Hugging Face and OpenRouter — the thesis is the platform shouldn't choose models for users.
  • Just-closed Series A at USD 120M valuation led by Airbus Ventures and MD Ventures. Usage data: consumers chase frontier (DeepSeek dominant in December, then Gemma exploding); enterprises optimize for efficiency (Mistral Nemo).
  • Top workloads are agentic coding (Cline/Claude Code analogues), AI companion/therapy/role-play (highest user counts but lower agent multiplier), coding, and ~5% ChatGPT-substitute traffic.
  • The most interesting tail is fine-tunes: Sambar AI (Uganda), Denu (agriculture), OpenHands (Singapore-trained), Cisco's security foundation model — region- and vertical-specific models are where a trillion-dollar inference market materializes.
  • Headline claim: open models have crossed the Sonnet/mini line and are approaching Opus; the two laptop-runnable models already beat GPT-4 on coding.
  • Pattern repeating: today's frontier runs on your laptop next year. Closing line: the bottleneck for Singapore-as-AI-hub is not models, it's builders — just build.

Why this matters

Featherless' demo of Qwen 27B and Gemma 3 1B beating GPT-4 on coding from a laptop, paired with a USD 120M Series A led by Airbus Ventures and MD Ventures, is a concrete data point that Opus-class capability is collapsing onto consumer hardware within 12 months. When inference commoditizes, defensibility shifts from model access to harness, distribution, and workflow capture — and the long tail of region- and vertical-specific fine-tunes (Sambar AI in Uganda, Denu in agriculture, OpenHands in Singapore, Cisco's security model) is where the trillion-dollar inference market actually materializes.

Slides (26)

Slide notes

  • Speaker intro slide: Eugene Cheah @ [Featherless.AI](http://Featherless.AI), presenting 'WTF Do People Use Open Models For?' at AIE Singapore 2026 [t=215:13]
  • AI model creator — notes less than 1 in 100 teams worldwide create AI models from scratch [t=215:13]
  • Founder and CEO of [featherless.ai](http://featherless.ai) — Series A, \$120M valuation round, led by Airbus Ventures and AMD Ventures [t=215:13]
  • Co-leads the RWKV open source project — the first AI model under the Linux Foundation [t=215:13]
  • Born in Singapore, repeat startup founder; over a decade in startups, enterprises, software and open source; flies between east and west regularly [t=215:13]
  • Photo backdrop: prior speaking slot at AI Engineer World's Fair (Microsoft-presented, Sierra/neo4j sponsors) [t=215:13]

Full transcript (1534 words)

All right. All right. Hey, I'm Eugene. I'm going to talk about open source models and why they are here and why Singapore should just build. Um, due to the limited time span, uh, I may slightly lean into English. I might say you go fast and I'm just going to do kick it off with a live demo. And for this live demo, I'm just going to like very quickly do a simple web game. Um, but what is more interesting here is I'm not going to use the best Frontier models. I'm not even going to use the best open source model. I'm going to use the Quen 27B and the Gamma 431B that can run on your laptop. So, I have the prompt here.

I'm just going to quickly get that and get that running and I hope my internet didn't disconnect on me. Okay, so as you can see um I'm using client which is one of the open source coding agents uh uh that is integrated in VS Code. Uh you can use anything uh that's not the point of this uh demo. The point is really just to show that these are models that you can use today to actually just build stuff. So trying to wait for this one to Okay, fine. Plan finish. It plan finish it. I didn't I didn't even check it. Okay, so these are models, right? More importantly that you can run on your laptop. So this is an example of MM Studio with the gamma 31B that uh that is running on my on this laptop itself.

You can run it on a Men laptop. You don't even need the highest end. Uh and this is the the the same coin 27B except that it's probably faster if I run it in the cloud. So uh I'll leave that one running in the cloud. Yeah. Yeah. So a little bit about my background. I'm Eugene. Uh I'm I'm an AI model creator. Uh one of less than a few hundred teams worldwide that have created AI models from scratch. Uh and significantly in Southeast Asia, there's only really a handful of us. Uh founder and CEO of Federalist AI uh recently did our series A at 120 mill valuation led by Airbus Ventures and MD Ventures.

I also co-lead the RWKB open source project, the first AI model under Linux Foundation and I'm born and raised in Singapore and a repeat startup founders. Um, and I work in startup enterprise software uh, banks, open source space for over a decade. I fly at pretty much every month between the east and the west regularly. What is Federalist AI? We are platform that provides instant access to the entire collection of open source model. Today is 30,000 models. In the future we want to support all two million models or even three million. At that point our principle is that we shouldn't be choosing be the judge to decide for you what models you want to use.

You should be able to decide for yourself. And so this is something that we are scaling to provide access uh instant access to everyone and you and you can access us through hugging face and open router. What makes this also interesting is that when you let the users have the choice of the model, the entire collection model, well, it's still in the early stage of 30,000 and we're scaling that up. You get to observe what models that people actually use when they given that choice. So that's pretty much the background for the talk like what the do people use open models for? Because at the end of the day right it's really about like getting those insights that I find more interesting respectively.

So to answer that question I'll split it up into two major segments. The first one is which open model classes are being used. This is typically what people find exciting when they are first entering the open source AI landscape because they like should I use the quen or the deep sync and things like that. But this is probably one of the hardest metric for me to present because every time I do the slides it gets outdated like the next week. This was December when most of our traffic was dominated by deepseek for consumers and for enterprise customers it was dominated by administr Nemoi. I think this is a very interesting pattern because consumers like to instantly test the latest and the greatest and experiment where enterprises like to run things at scale and so they focus on efficiency.

But soon after it got replaced and then like just a few days ago like gamma started exploding off the charts and and and this is literally the updated version chart for that I had to update for the talk itself. Oops. Oh, okay. It it ran finish. Okay. Uh and so this is a shout out to uh Ivan and the Google Tig team. They did an awesome work with the gamma And so what are these models then used for? Um oops live demo issues but never mind. Like increasingly we all heard about open claw agentic use cases that that that represents a huge bar of our traffic. The other major one is AI companion therapy and role play that actually virals the agentic claw usage but the agentic claw usage will be a lot less users running a lot of agents where the AI companion space you'll be some usually commercial clients where they where a company will have thousands of users coding use cases these are based on the metadata that we have like client and clot code and things like that we can see these kind of use cases and subsequently like 5% chbtt Oops.

Once again because we do not perceive any prompt on the uh completion data we infer this number approximately. So what is interesting beyond that right is down here I'm uh down here I'm representing by model classes but when you represent it by fine-tune models be you may have heard of fine tunings to specialize the models for your individual use case or company use case you can see the difference in the chart respectively. What I find most interesting is not the top one/ird or half which is usually all the popular models but the bottom half because if this inference market is going to be a trillion dollar market this is where the things gets interesting and this is where we see AI models being support fine tuned to support specific region like we are proudly one of the providers for the sambar AI one of Uganda's first language model or the Denu AI model which is an agriculture language model we also see use cases for medical for Open hands which is also trained in Singapore and also like for security like Cisco foundation model respectively.

So what what I find exciting about these trends is that more importantly open models are crossing the current sonet and mini line and operating of opus the level of intelligence in laptops accelerating and long context cost is dropping. I'm a bit pressed of time so I'm going to move faster. This is basically open models matching sonet and approaching opus for for the AI model. Yes, still slightly behind, but it's almost there. But this is the more interesting one. The two models that I was running already surpass GPT4 encoding use cases. Sure, they may not be GPT5, but mind you, they run on the laptop.

Basically, the models, the best models you see today will possibly be running on your laptop next year. That's the pattern that has been repeating in the open source space. And that is why I'm going to skip this part. Uh that's why right like I this is an important thing that I want to stress to all the AI engineers here because let's just see the live demo. Okay. Okay. So this is so this is one of the asteroids. Um let me see. This would be the gamma 31B but let's just try uh opening for the quen 27B for example. And you can see this is also another one. The fact that this works on potentially running your laptop, right, is the significant thing because right now today all these models that run your laptop can do UIs, APIs or anything else.

And sure it may require a few retries. But if we want to make Singapore the AI hub of the world or Southeast Asia, the problem is not the models, it's us. We just need to start building. And that's what I want everyone in Singapore to start doing. Just build because there is no barrier. Yeah, that's all. Thank you. Thank you so much. Thank you so much, Eugene. Um, next up we have Max Buckley who's the head of knowledge research from XAI. Max will be talking about well his top talk title is November 24th, 2025, what comes next. Max, over to you. &gt;&gt; Hello everyone.

What Comes Next When Software Scarcity and Coding Costs Collapse

Max Buckley (Head of Knowledge Research, Exa AI)

Summary

  • Buckley argues Claude 4.5 Opus (released Nov 24, 2025) marks a regime change: cost-of-effort assumptions underpinning institutions are collapsing, and you can't opt out of the game-theoretic consequences.
  • Proof-of-work signals have inverted — eloquent emails, typo-free writing, polished school assignments now signal an LLM; a typo signals authenticity.
  • Coding went from tab-complete to autonomous agents running for hours; the models haven't internalised the shift (Claude still quotes 12-week estimates for projects it ships in 30 minutes).
  • Coding agents dissolve the IT-literacy barrier: natural-language interfaces let non-engineers operate computers, and open source effectively becomes open to anyone literate.
  • Economics-of-scarcity stack (ruthless prioritisation, quarterly roadmaps, PM funnels whittling 1000 ideas to 16) was built on software being expensive and slow — that scaffolding now has to go.
  • Moats shift from code (easily reproduced) to brand, GTM channels, and ML/data (non-deterministic boundaries, harder to reverse engineer).
  • Software supply explodes (GitHub commits up 14x YoY at run-rate); marginal cost of a bespoke tool approaches zero, so the bottleneck moves to GTM and code review.
  • Compounding skills: statistics (broadly distributed, not centralised), ideation, taste, knowing what exists rather than syntax depth. Closing line: the question is no longer can you build it, but what should exist.

Why this matters

Buckley's argument that Claude 4.5 Opus marks a regime change in proof-of-work signals and software economics is the cleanest articulation builders will hear of where moats are moving: code itself commoditizes while brand, GTM channels, and ML/data become the durable layers. The operational shift — build 30 ideas and let evals decide instead of whittling 1,000 to 16 in PM funnels — combined with the proof-of-work inversion (polished prose now signals an LLM, friction signals a human) reshapes both how teams ship and how they communicate.

Slides (9)

Slide notes

  • Slide 04/12 — Section 03 THE PHASE TRANSITION: "Code completion -\> coding agent." Four-era timeline: \~2018 Tab-complete (Finish my line); \~2021 Copilot (Finish my function); \~2024 Chat coding (Write me this thing); Nov 24 2025 Agent (Build it. Test it. Fix it.) — the last card is black/highlighted as the regime change. [t=225:07]
  • Caption under the timeline: "Claude tells me a project will take twelve weeks. Claude Code cooks it in 30 minutes." — 12wk -\> 30min compression as the lived proof of the phase transition. [t=225:07]
  • Talk framing reiterated on each slide: M. Buckley, Head of Knowledge Research, Exa AI, AI Engineer Summit Singapore, Nov 24th 2025 — "What Comes Next?" [t=225:07]
  • Slide 06/12 — Section 05 THE OLD AXIOMS: "What used to be true." Three legacy industry assumptions being dismantled. [t=228:40]
  • Axiom I: Software development was expensive — "Engineers are scarce, salaries are high, every feature has an opportunity cost." [t=228:40]
  • Axiom II: Software development was slow — "Complex features take months. Roadmaps are quarterly because the work is." [t=228:40]
  • Axiom III: Therefore: prioritise ruthlessly — "Whittle 30 ideas down to 3. Prioritise ruthlessly. Build only what is strictly necessary." [t=228:40]
  • Closing italics line ties it together: "Every habit, every process, every org chart in our industry assumes these facts are true." — setup for inversion: if cost and speed flip, PM/roadmap/org-chart discipline built on scarcity becomes obsolete; moats migrate to brand, GTM, and proprietary data. [t=228:40]
  • Frame at t=232:14 is a speaker/audience shot with no slide content visible — skipped. [t=232:14]

Full transcript (2376 words)

Uh, I'm Max from EXA. Um, I'm the head of knowledge research and I'm also in charge of the Zurich office which we're currently setting up. This is more of an existential talk, so I won't really be talking about EXA here. Um, and this is not a typo, although it was asked many times, was a typo. November 24th, 2025, what comes next? Um, so what is November 24th, 2025? And that is the day Claude 4.5 Opus was released. And my position here is that that will go down in history as a day when things changed. So my proposal to you here is that the game theory underlying sort of society is changing and Genai is driving this.

Um and I'll give a historic example with ChatgPT a few years ago and I give a more recent example with Opus. But basically the institutions that we have were built on the assumptions that certain things are costly and these costs make certain things work right but when we remove the costs the systems built around them they can fail to work they can crumble. So one historic example of this was proof of work, right? We had a lot of systems that required people to make an effort in order to kind of prove they had made an effort. And by doing so, you know, you would get people to learn in schools.

You would find which com which people really wanted to apply to your company for jobs. You know, you could also, you know, know if someone was credible. Nowadays, if I get a message, like an email or a LinkedIn message and it's really well written, I don't think the person is really eloquent and really made an effort to talk to me. I think the person just used an LLM. Whereas previously the opposite was true. Nowadays, if you get something with a typo, someone either got a model to generate typo text or edited it intentionally to make it more typoed. And the reason I talk about this in game theory lens is that you can't opt out of this.

Even if your university has come up with some claim like we don't allow Gen AI, you know, projects, that just means your students have to edit the delves out of it and remove the emphasis dashes. So you can't opt out of these changes. They're coming for you. Um and a similar shift is happening now in coding, right? So over the last sort of eight years, we've gone from, you know, tab completion where you complete a line to completing a function to being able to ask it to generate a file to now where you have this coding agent where you can just give it this highle prompt and it will run away for you know minutes to hours and build the whole thing, test it and verify it and come back to you when it's done.

And this is quite a shift and something that hasn't fully been kind of played out yet. What's interesting is the models themselves aren't even aware of this shift. So if you use Claude, it will still use the time estimates that used to be true. So if you give Claude a big spec and say, "Here's a crazy idea. Let's implement this research paper." Claude will tell you this project will take 12 weeks. You then copy the markdown into Claude code and it wors away for 30 minutes and then it's done. You know, clearly it hasn't understood how much the world has changed. And I don't think this original estimate was wrong.

Like I've worked at Google with several like, you know, very good engineers where you would assign this to like a an junior engineer and it would indeed take them 12 weeks and that's 12 weeks of check-ins and iteration and making progress. Remember this concept of IT literacy? I mean, probably I'm preaching to the wrong audience here, but it used to be the case that many people were scared of computers or found them difficult or hard. And the reason for this was that computers were hyper literal, right? If you missed a semicolon or had something like a typo, the computer would just say, "That's not found.

That doesn't work. You're out of luck." Whereas it literacy was about helping people, normal people, get used to using a computer, like make them realize that yes, you missed the semicolon, but don't worry, you can just put it in and it'll still work. No, an illegal operation is not actually a crime. Don't worry. Um but again one of the things that coding agents are driving and I think coding agents that term even undersells the potential is a shift here right because coding agents or just having an agent running on your computer makes computers have a kind of natural language interface like normal people now have this one hurdle which is how do I open the terminal?

How do I launch clawed code? And now they can use a computer in a way they never could before. They can talk to it in natural language. It can talk them through how they do whatever they want to do. How do they set up their printer on the network? How do they, you know, take a screenshot? How do they debug if their camera is visible or not, right? And this something they could not do before. And open source is next. And I say next, I mean it's it's already happening. I mean, there are people in this room who have talked about some of these facets, right? But, you know, open source used to mean open to engineers.

Now, it means open to anybody who has a computer and is literate, which is quite a bit more open. Of course this comes with new problems, new challenges. So yeah, what used to be true like these are the assumptions of the world pre end of last year. So it used to be the case that software development was expensive. There were few people who could code. Those people were very skilled. Their time was very valuable. Um so you know we basically every feature had an opportunity cost. There were whole pieces of organizations designed to ensure we work on the right things by for some definition of right.

You know there was endless debates by managers, program managers, product managers, technical program managers, whatever you want to call them about which project should we do, which ones should we depp prioritize, how much should we invest in fixing the bugs versus how much should we invest in adding new features. Similarly, software development was slow. So even a small feature would take you know a few hours maybe days you know a big feature could take weeks could take months a really big rearchitecture of a system could take years for multiple people right and of course one nice thing about this was that like road maps kind of could align with this quite well right because the road map could be quarterly because effectively the work was quarterly I remember working in Google and you know we'd assign maybe someone like four five six bullet points for their quarter that was the four five six things they were going to work on and do that quarter And usually they would do like 70 to 80% of them.

Um, and so because of these two things, you wanted to prioritize ruthlessly. And there were again systems designed to do this, right? You know, we'd have sales teams who were filing hundreds of issues, requests, features, ideas. And then you'd have program managers sifting those hundreds or thousands down to 30. Those 30 would go to engineering managers who would debate them and would draw a line and say, "We'll do the top 16." And those 16 are fanned out to like the engineers on the team. And so as I say the interesting thing here is that all of our like processes and habits and org charts assume this to be true.

So all of this is going to have to change as these things change. And yeah so basically this whole thing is based on this economics of scarcity right that you know every line of code was very valuable so we should you know prioritize things a certain way. Um you know things like software as a service is a funny one. We've all heard about that it's in kind of danger now. And it's interesting because you know with a good set of engineers you could build in theory like a uh like a workday competitor or whatever other software service thing you wanted to but the question is did you want to like were you willing to commit several people for several years and several million dollars to try to build a basic version and then get to the challenge of selling it and convincing people to switch.

Now that's a lot easier to do and this makes people realize that the moat is not the code anymore but now it's going to be your brand your like go to market channels. I do think ML and data hold a lot modes will hold longer because it's much less clear how exactly where the boundaries lie right so it's harder to kind of reverse engineer than something kind of deterministic and this scarcity thinking also is going to have to go right so this idea of whittling 30 ideas down to three by professional judgment and then implementing three you know we don't need to do that anymore we can now build all 30 do good evals do benchmarking see which ones were actually worthwhile and you can revert the rest and we won't be so attached to those that we revert because we didn't spend 3 months building it and our promotion case doesn't rely on it.

So yeah, the supply of software is going to explode. I mean this is not an original thought. Um there's tweets from the COO of GitHub recently saying that as of the current like run rate, GitHub commits are up 14x year. It's like over 2025 which was already up 4x on 2024. So it's 14x at the current rate and it's growing. So it's going to be even more. What's especially interesting here is that the the marginal cost of like a new tool is almost zero. So nowadays if you're given a task like maybe you need to label some data or debug an issue, you can quickly throw together a new custom UI that you use for that task and never again.

And this is crazy because this UI might take 20 minutes for a claw to cook it, but it may make you like 10 times as effective at like labeling data or sifting through images or whatever else, right? Like because as a human you're good with visual data and you're not necessarily so good with text or whatever else. And now we can just build all these niche apps that no quarter ever justified. Um, so the bottleneck is going to shift to go to market and code review because now that you can build everything, so can everyone else. So people are going to be competing even more to get people to use their ideas, to see their ideas, to hear their voice.

And code review has already been talked about, so I won't dwell on that right now. But basically, you know, code review is just again struggling because of the amount of code we're producing. And of course, AI can also help with this. So what I think is valuable now, what I would invest in is statistics. So statistics was always very valuable at big companies like Google. There was always, you know, some team, some people building statistical tooling for evaluating experiments and then many engineers that would just rely on that tooling. They would just opt in. Now it's probably more useful as a more broadly distributed skill because everyone can be evaluating all sorts of things in many different ways.

And evaluating here could be different things. It could be profiling for performance. It could be benchmarks, AB tests, user behavior metrics, these things. Uh ideation and taste is another important thing. So basically the idea of what to build, having ideas is going to become even more important. And then of course iterating on these ideas and jagged. So my final point here is just that the val the specific value of knowledge I think will change. We're going to move from deep technical expertise when you really detailed know the exact syntax of something to knowing what exists, how and when to use it.

Because with these models, if you prompt them kind of generically, they will kind of often give you a sort of generic response. Whereas when you prompt them with a sort of the right words, it kind of unlocks this strange potential. Like my final example is statistics. If you say please, you know, benchmark my change, it will often do n equals one, run it once, run it twice, see which is faster. If you say use statistics, suddenly it starts spouting things like p values and t statistics and all of these other things and large sample sizes and it goes crazy. Yeah. So that's that. Yes. So basically the question is no longer can you build it.

The question is what should exist. Thank you. Thank you so much to Max from XAI. Next up we have Mark Doyle who is a software engineer with Stripe. Uh as you may make your way over to the stage and Mark will be sharing uh a little talk about Minions uh not quite Minions the movie but Minions which is Stripe's oneshot endtoend coding agent platform uh you'll talk about how they're building it why they're building it what the reasons behind it and some thinking behind how they think about coding Hi everyone. Uh

Minions: One-Shot Coding Agents at Stripe

Mark Doyle (Stripe, Coding Agent Platform)

Summary

  • Stripe processes ~2% of world GDP under a hard quality/security bar, yet 91% of Stripe engineers merge AI-written code daily and fully-AI-generated PRs are up 500% YoY.
  • Mark runs Stripe's coding agent platform — 'Minions' are internal one-shot coding agents that go from a Slack prompt to a merged PR with zero human steering. Distinction: harnesses (Claude Code, Codex, Cursor) are co-pilot; Minions are fire-and-forget when the engineer already knows what the diff should look like.
  • Architecture: each Minion gets its own dev box (Stripe's pre-existing remote dev env — needed because the monorepo is ~300M LoC / 90GB); analyzer agent collects Slack/ticket context and points to the right code region; loop runs coding agent -> lints/tests/typecheck -> LLM judge with clean context -> diagnostic agent -> feed back in, until a PR exists.
  • LLM judge only sees original prompt + current diff, so it can't be talked into 'done' by the coding agent's excuses.
  • Current numbers: ~3,000 PRs/week through Minions, 65% one-shot merge rate with no human intervention.
  • Lesson 1: deterministic guardrails beat screaming-caps prompts at scale — if you're begging an agent in CLAUDE.md to run tests before pushing, just enforce it.
  • Lesson 2: developer tooling (Sorbet, linters, type checkers, fast dev boxes) is now table stakes — bad tooling no longer costs engineer-hours, it costs thousands of failed agent cycles.
  • Lesson 3: building agents into Slack made adoption viral — engineers and non-engineers see colleagues @-tagging Minions in public channels and copy the pattern. Stripe is hiring an EM for this team.

Why this matters

Stripe's Minions show what coding-agent infrastructure looks like when treated as a first-class platform investment at a company processing 2% of world GDP: 3,000 PRs/week, 65% one-shot merge rate, and an LLM-judge-with-clean-context pattern that prevents the worker agent from talking itself into 'done.' The real moat isn't the model — it's a decade of pre-existing dev-box infrastructure, Sorbet static analysis, and deterministic guardrails that make agentic development reliable at scale. Two transferable primitives: adversarial verification (judge sees only original prompt + diff, never the coding agent's rationalizations) and Slack-as-agent-surface for viral internal adoption.

Slides (14)

Slide notes

  • Minion loop architecture (on devbox): Prompt + conversational context -\> Plan next step -\> Make changes (Edit, Write, Bash...) -\> Validate (tests, lint, typecheck) -\> 'Did the agent complete the task?' -\> if no, Diagnose what happened (bug icon) and loop back; if yes, Create PR. Single agent with self-diagnostic branch, not multi-agent orchestration. [t=242:17]
  • Trigger surface = Slack: 'devbox' app posts agent-run completion in-channel. Example: agent run on 'malformed-diff-output-line' completed; Status: Process completed; Summary: agent identified and fixed a bug in parse_diff method of Sourcegraph::Client producing malformed unified diffs (using '--- b/' instead of '+++ b/'); generated PR with one-character fix + updated tests + PR title/description. Action buttons: 'Next steps' and 'See changes'. [t=237:46]
  • Async/headless UX pattern: engineer dispatches from Slack, agent runs on remote devbox, returns PR + summary + diff link back to Slack -- no IDE handholding, no live chat loop. Output is a reviewable PR, not a conversation. [t=237:46]
  • Closing thesis slide: 'Now developer tools are essential leverage for agents' -- Stripe's framing inverts the usual ordering. Devboxes, monorepo tooling, lint/typecheck/test harness, PR system aren't dev ergonomics anymore; they are the agent's substrate. Whoever has the best internal dev platform gets the best agent throughput. [t=246:48]
  • Validation gate is load-bearing: tests + lint + typecheck run inside the loop before the completion check, so the diagnose branch fires on real failure signal rather than LLM self-report. This is why the one-shot merge rate holds at scale -- the loop only exits on green CI signal. [t=242:17]
  • Operator takeaway: the moat isn't the model, it's the deterministic validation surface (monorepo + fast tests + devbox) wrapping the agent. For a payments shop, this maps directly to: ledger invariants, idempotency tests, contract tests as the agent's guardrails. [t=242:17/246:48]

Full transcript (3156 words)

Thanks so much for sticking around. I know it's nearly launched so uh hopefully you can keep this uh really interesting. Uh I work at Stripe on our coding agent platform. Uh my name is Mark. Uh so roughly anything to do with uh writing code with agents and uh the whole software engineering life cycle uh with coding agents I'm roughly involved in um just to before we start talking about what we'll talk about today which is oneshot coding agents. So uh going from a prompt straight to a PR and one shot just to frame the problem a little bit at Stripe that uh we process close to 2% of the world's GDP on Stripe.

So even though we're trying to be on like the really bleeding edge and the forefront of AI with using the models, uh we have like really big obligations to our users and our customers and even the broader global economy to you know maintain a quality bar and a security bar. Uh so that's definitely our like number one thing we keep in mind while we're building all this. That said though we have 91% of Stripe engineers are writing code with AI on a daily basis. So 100% of Stripes are using AI in some form during the software authoring life cycle. Um, but every day we have 91% of our engineers merging code with AI.

In the last year, we've seen a 500% increase in the number of fully AI generated pull requests. Um, so today, yeah, we're just going to talk a little bit about like how we're making that happen. Um, and how oneshot agents are, you know, enabling that for us. Um, one shot coding agents are sort well-known term in the industry, I guess, but something we use uh internally a lot is um creating a PR when you go from just straight from a prompt or a slack thread all the way to the poll request uh just without any interaction. So we all we in Stripe also have the harnesses much like I'm sure all of you have like clawed code, codeex, cursor um we use those as well but we see those as kind of like a co-pilot harness.

So that's when the engineer is sitting in, you know, in tandem with the harness working in like an iterative manner. Oneshot coding agents are specifically for when we think uh the engineer knows roughly what the pull request or what they're trying to achieve looks like. We we don't need them to sit in tandem with the harness like for extended periods of time. So we think it's a little bit wasteful for engineers to be juggling like tons of different teamwork sessions connecting to different agents on different boxes when maybe they could have the planning session with the agent up front and then just kick off this oneshot um experience and not do not have any involvement until they get to the code review phase.

Um so yeah, our goal is to just like save our engineers time. You know, we don't want them spending time like spinning up new development environments, creating branches, pull requests when they already know roughly what code they're going to write. We want to offload all that work to the agent, not just the actual writing of the code. Um, so I'm just going to give you an example of me using one of our oneshot agents. So here I'm investigating a problem with one of our MCP tools with Stripe. It's just uh this is like a very simple example just to show how we do it. Um, we have these like agents in Slack where uh we can say hey what's you I'm seeing this issue.

What might be the problem here? Um, so straight away the agent will come back. It'll read our code, read our documentation, and say, "Oh, look, this is seems to be the issue you were looking for." It's just literally, in this case, a threeline uh or three character diff. It's a very like straightforward change. And right now, the developer in me in the scenario knows that um this change is very simple. Like it's it can be implemented by roughly anyone. You probably wouldn't even need to be an engineer to make this change. So we don't want our engineers now to, you know, spend the next 10 minutes, creating branches, spinning up an agent, explaining the problem again to an agent, copy and pasting this context.

We just want them to literally be able to say, "Hey, go fix this issue. Once you've come back with the pull request, I can approve it and uh or get my colleague to approve it and merge it." Um, so Devbox in this situation is just analogist to minion, which is what we call our one shot agents. Um, and the developer then can expect sometime later to see a response like this where the minion comes back and says, "Hey, our process is completed. Um, go check out like what code I've written." So, the developer didn't need to be in the loop for any of this. And that's like a little bit about the like philosophical side I guess of why we want to do this like why we think we're saving engineers time.

And now I can explain like how do we actually achieve this outcome. So u we saw in the this previous message when I instantiated the agent we see this like message the agent says hey one second I'm cooking I'm working on your task how do we go from you know that message to actually getting a pull request that the engineer can review so is really lucky we've been investing in dev boxes which are are remote developer environments so stripe engineers don't write code on their laptops they write it on remote developer environments and we could probably give a whole talk about why we need these stripe has like a super large monor repo one of the biggest git repositories in the world.

It's close to 300 million lines of code. So like if you clone down our repository, it's like 90 gigabytes. Um it takes a long time to generate our code. So we kind of need to have these remote developer environments. So every time you want a fresh branch or something, you can just get it straight away. We have a pool of them. They're ready to go. And we're really lucky. We invested in these for years because it turns out now they're, you know, really good homes for agents. And agents can be really comfy there. They have all their tools. Um these are not lightweight sandboxes like what we see a lot of in the industry today.

These are quite large like developer machines are uh lots of cores 64 to 128 GB of RAM pretty big machines pretty capable of um like for large scale engineering tasks um and every minion gets their own dev box so they have their own home there um where they can you know one from a security standpoint are safely isolated um sandbox etc. And then two it's just like a good environment for them to write code in. So once we've given the minion some compute the dev box so like a computer to run on it needs like a file system it needs a a shell we've given it that with the dev box the first thing we do to try help it operate in this giant codebase is we hand the prompt or the slack thread all the context we can gather so say in uh the example I showed it was a slack thread where another agent had you know uh searched the codebase given some context maybe there could also have been a a ticket mentioned a pull request mentioned some other context from a colleague we gather all that information and we hand it to this analyzer agent you see here that analyzer agent you know gathers all that context and says okay this is where I think we need to point the agent this is the right part of the codebase and that's when we start uh the actual implementation phase so once we've figured out where we're going to write the code or like approximately what the task look like just summing the whole um contents of the slack thread or wherever we started the minion from into some into a prompt uh we can start this minion loop so the minion loop is the process of making sure we always produce a pull request and the agent doesn't stop in the middle.

And this is what the minion loop looks like. So we start at this white arrow at the top where we take that context I just uh explained where you know everything from the slack thread and we give it to this coding agent you see in the white box. It's just a regular coding agent. It takes the you know as you as maybe claw code or codeex you're very used to using does takes the prompt conversational context and tries to you know advance it makes a turn tries to advance towards a goal. Um after it's you know advanced towards that goal we make it run lints we make it run tests and type check and then we stop and we don't go back to the human this is sort of the difference between oneshot agents and uh you know co-pilot agents is here we pass the result to an LLM judge which is this orange box you see at the bottom of the screen and the lm judge literally takes the prompt that the original author gave to the minion and the current git diff or the output that's been produced And we just ask it is this task complete?

So it doesn't get its context doesn't get poisoned with all the like information conversation all the you know excuses that the coding agent might come up with for why it stopped working or why this task is impossible etc. Um it is lally just a you know unbiased judge that says is this task complete or uh has it failed. Um if the task is complete great we can you know create the pull request and go back to the engineer and say hey it's uh it's ready for your review. Um, I'm finished here. If it's not complete, um, we have a diagnostic agent that looks at the, you know, looks at the output of the LLM judge, looks at what happened in the coding agent session and the original prompt and says, "Oh, uh, this isn't finished because some test failed or this isn't finished.

This hasn't finished yet because it's you've actually implemented the wrong thing. You know, you uh created an API endpoint, but you didn't wire up the front end. Obvious things coding agents will miss." Um, and then we take that context from the diagnostic agent and put it back into the loop. So, we run this loop as many times as we need. And we try to keep the input from the diagnostic agent very short. Uh, so it doesn't blow the context window. But we keep running this loop with the diagnostic agent, the LM judge, and the coding agent. Just keep running it until we have something that resembles a pull request.

It's not always going to be the case that the pull request is correct, but at the moment at Stripe, we're merging roughly 65% of minion pull requests on one shot. So 65% of the time a stripe engineer starts one of these it's being merged without any human intervention. So it's getting pretty good. As the models get better um we see this working more and more. Uh usually it's like the engineer will then want to if it's not you know successfully oneshot it the engineer will want to jump in um and make some changes. So to that point we have a web interface for you know you can continue steering the conversation.

You can also uh you see in the top of the screen here like open the box the minion was spawn spawned in on VS code or in a terminal. And so that lets the engineer take over in the case that the minion like failed to one shot. And so that's kind of the story of how we take this like little coding agent, give it a give it a place to live and then produce these like oneshot pull requests. Um we're merging like 3,000 pull requests a week at Stripe with these. Uh it's really like valuable for um you know saving our engineers time of solving the really small problems and even bigger bigger tasks that the engineer already believes that the agent can oneshot or uh can it can the engineer can provide significant context up front that lets the oneshot PR Um, so if you're building systems like this, there's probably a few lessons we can give you to take away.

Uh, we learned that prompts are really good. So in all of our agents here, like the LM judge, the um actual coding agent itself, etc. We have like very detailed prompts as you can imagine. We've thousands of clawed and agents.md files around our codebase. They're very valuable. However, if you're writing uh one of these these loops like a minion loop and you're constantly making prompts that look something like please please run the test before you make a commit, don't push and run like an expensive CI run with you know before you've run the test yourself or um please format your uh commit messages in a certain way.

You're you know writing in screaming case all capitals you're trying to really trying to convince a coding agent to do something. In that case, we uh really think deterministic instructions are just far better for this. So, anything you can make deterministic, please do it. Um it's really it really helps the agent be successful. Uh trying to argue with agents for things is usually not a great it's kind of like a code smell. Um especially if it involves security things. Uh so yeah, deterministic instructions for writing these kind of loops is like absolutely critical and it just lets the the process be so much more reliable.

If you're, you know, building your own workflow, it may be fine to rely on these like screaming case um context files, but for doing this at scale when you have like thousands of developers running thousands of minion runs, uh this has been really useful for us. Uh our second takeaway is that uh developer tools are always super important. So at Stripe, we've always been really lucky that invested pretty heavily in developer tools for a company of our size. Um for example like Stripe open source sorbet which is a a static analysis type checker for Ruby. It's like analogist to typescript for JavaScript.

Um lots of tools like this Stripe was built to boost our development uh velocity over the years. But more so than ever this is like so much higher leverage. So now we see that like these tools are like you must have them. So if you don't have like good uh compute primitives for your agents to run on like for us dev boxes, you don't have static type checking, linting, all these things we expect to have as professional developers. The better your tools are, the more you can do agentic development. So if you don't have these things, it's no longer like, oh, my engineers are losing an hour a week to it.

It's you're losing like thousands of agent cycles that are failing or, you know, are taking much longer than they would have before. So now we're like doubling down even more than we had on building like even better llinters, formatterers, analysis, all these kind of non LLM related things that are mostly static analysis. Um so that's been really valuable for us. Um the last takeaway we have is that um building on Slack has been really valuable for us. So you saw earlier in my uh presentation we have this at devbox or at minion slack message where we can kick off a minion. That's been super valuable for educating like all our engineers about using AI and non-engineers as well can kick them off.

So that process of um sort of building in public and sharing with our engineers that hey it's you know maybe you were about to go down the you haven't been reading on Twitter the latest and greatest in AI um you maybe would have gone down the path of opening up your editor and manually making this change or using um tab completion or something. Now all our engineers see other engineers working in public and just tagging these minions being like, "Hey, go do this thing." That's been really helpful for like helping get our, you know, really large um organization on board with using AI for lots of tasks.

Um that's been yeah, working in public within your companies has been super useful as well. Uh finally, we have a booth over in the rest of the conference. So if you'd like to come chat to me about minions or ask any questions, uh please do. Also, if you think uh working with minions or on this platform is interesting, Stripe is hiring uh jobs. We're actually hiring an EM or engineering manager for my team specifically. So, if this sounds really interesting, uh you should come work with us. Um I'd love to work with you. Uh we also have a giveaway here if anyone's interested. Uh you can come by our booth after and check it out.

But, uh thank you very much. Thank you so much, Mark. Um, our last talk before lunch. I know everyone's very hungry, but lunch won't start till this talk ends. So, I hope you guys give Liha the time uh for him to present. Liho as you said up uh Lihao is a software engineer with a company called similar and he will be talking about from playing solitaire to operating ERP software. Why does your computer need to learn to click and type? So similar is building tools that uh are really good for computer use and Leha will show you how that works. Leha over to you. &gt;&gt; Thank you. Thank you so much.

So, how many hours a day do you think you've moved your cursor around the screen? Five. So, we

Building Autonomous Computers: Reliability, Trust, and Scalability in Computer-Use Agents

Li Hau Tan (Member of Technical Staff, Simular)

Summary

  • Simular is building 'autonomous computers' — agents that see the screen, understand the task, and operate native desktop and legacy software the way a human would.
  • Wedge: APIs and browser agents cover part of the workflow, but enterprise tools (SAP, QuickBooks, internal legacy systems, proprietary desktop apps) have no API and no browser — the only way in is the screen.
  • Agent S3 surpassed human-level performance on OSWorld in December.
  • Reliability splits into grounding (accessibility tree + specialized vision models that locate text and buttons when the tree has gaps) and consistency, reframed from pass@K (succeed once in K tries) to pass^K (succeed every time across K repetitions).
  • Solution is neuro-symbolic: the LLM observes and reasons once, then emits code in a DSL called Simulang; subsequent runs replay the program with zero inference cost.
  • Trust is enforced architecturally — a separate guardrail system, not the planner, decides what's safe ('cannot be judge and player'), citing the Meta AI alignment director incident as the failure mode.
  • Scalability comes from cloud-spawned VMs: sign up and you get a machine, spin up 100 to parallelize regression tests, CRM updates, reporting.
  • Simulang is being released to developers.

Why this matters

Simular's pass@K vs pass^K reframing — succeed every time across K repetitions, not just once in K tries — is the cleanest wedge framing for enterprise agent reliability: enterprises buy determinism, not capability. Their neuro-symbolic record-replay (LLM reasons once, emits a program in their Simulang DSL, subsequent runs replay with zero inference cost) is the clearest example yet of a harness layer that compounds margin as usage scales, while the separate guardrail architecture (planner cannot be its own judge) is a defensible pattern for any agent that touches consequential systems. This is the path into legacy enterprise surfaces — SAP, QuickBooks, internal tools, regulator portals — that will never get APIs.

Slides (17)

Slide notes

  • [t=255:18] "When the tree has gaps, vision fills them in." Neuro-symbolic perception: Accessibility tree (symbolic — window/button/heading/textfield nodes with roles, labels, values like Customer='Sai & Co.', Amount='\$1,240.00') PLUS Visual Grounding (two visual experts: OCR & text, Buttons & icons). Tree gives structure; vision fills the holes the a11y tree misses.
  • [t=258:42] "If you can spin one up — you can spin a hundred." Thesis: "One agent is a demo. A hundred is your team." Reliability framing shifts from single-shot success to fleet-scale orchestration; record-replay + DSL make agents reproducible enough to parallelize.
  • [t=251:54] Live demo: "Here's Sai, playing Solitaire." Sai agent runs in a macOS window controlling Windows 11 Spider Solitaire (One Suit, Score 0, Time 02:40). Left pane is the agent's chain-of-thought game session: analyzes board state, enumerates columns (J,2,3,5,4,9,10,6,7,5,4,10,3), proposes legal moves (2→3, 3→4, 9→10), narrates plan ("move the Jack onto the Queen"), then logs actions ("Dragging Jack onto Queen with corrected coordinates," "Choosing best aim attempt"). Demonstrates closed-loop perception → reasoning → click/drag execution on an unmodified GUI.

Full transcript (1677 words)

If a few months ago, we did an experiment with a group of friends. Some of them are like you, right? AI engineers, builders, and we also have doctors, admins, accountants. And track them, see how how much time they spend on moving the cursor, right? And this is what we found out. Five hours a day. We have someone who move them their fingers on a trackpad for more than like 5 hours a day. That's more than onethird of your time awake, right? Not not creating, not thinking, but just moving, clicking around. Sorry. Clicking around, navigating, right? Scrolling through tabs, uh, through So we've we've put a lot of our work into this digital space but the way we interact with it is still incredibly manual.

The PC. We have PC in 1981, right? Suddenly we're able to do things that we used to take hours within minutes, right? It's a big leap and we freed ourselves from filing cabinets and paperworks. But now look at us 40 years later. We're still clicking, scrolling, navigating around, right? Five hours a day. We traded off like one kind of manual labor into another kind. So we need the next leap. So what would truly an efficient way of interacting with computers, right? What if you don't have to interact with the computers at all? What if the computers can operate on its own? It can see the screen, understand the task and just do it.

And that's what we are building here in similar. We call it an autonomous computer. Right? So this is what keeps me exciting and this is what we're building. So my name is Liha. I'm a techn a member of technical staff at similar uh we're building the infrastructure for autonomous computers. Right. So last December our research agent agent S3 has achieved a surpass human level performance in um in OS world which is the standard test for computer use. So what does an autonomous computers look like right? So let me show you this is s this is our um product. So on the left you can see that the screen we have this chat interface where the LM is trying to understand the task.

It's trying to play a solitire. So it's trying to look at the screen understand what's going on and try to see like what's the next best move and trying to figure out how to move the mouse and drag the cards. Right. It's on the right is the machine that's running on and so hopefully in a minute or so if he gives them encouragement hopefully Sai will play the first move. Yeah. So you can see that it actually able to control the mouse cursor and drag the card right from the left to right. But this is just one app, one task, very clear rules, right? But imagine your actual workday. At work, you what do you do, right?

You have emails, you have Slack, uh you have sorry, hold on. So, at work, you have email, you have Slack, you have spreadsheets, you have your PowerPoints, you have your QuickBooks, SAPs, and some of the legacy systems that your company refuses to retire, right? So some of these tools have APIs, right? So this is where we had a lot of activities is going on last year where we have API or CLI agents. We have um tool calling, function calling. Great. This part has been solved and some of these apps is in the browser, right? So you you heard talks about browser use agents which can handle um navigating and looking at your browsers for you.

Great. But again there's everything else that your desktop apps your legacy systems your proprietary tools there's no API there's no browsers so there's no the only way in is actually through the screen right so that's computer use so teaching an API to see the screen understand what's on it and operate it just like you would with that the autonomous computers is complete we have API and CLI agents we have browser agents and computer user agents working together simless to the users and handle any task on the browser. So this is what we are building at similar and in building it I would like to share three main challenges that we have faced reliability trust and So let's start with reliability to a user.

Reliability means one thing, right? It works every time. Two things have to be true for that. The agents needs to see the screen and act on it precisely. That's grounding. And it has to be able to do it across multiple turns. So across 100 repetitions, that's consistency. So you have grounding and consistency. Let's start with grounding, right? How does a blind person sees a screen? You use screen reader, right? So, it reads accessibility tree structured map of every element, the name, the type, the state, and that's our starting point, but it's not enough. A lot of times there's apps that don't have the complete tree.

Sometimes buttons hidden inside um menus, uh drop downs, you have elements that show up dynamically, and some apps barely even have a tree at all. So we have to back it up with vision grounding. So we specialize models that can actually look at the screen visually and figure out where are the elements, right? One can read the text, one can find buttons and elements. When the tree has gaps, the vision try to fill it in. So we have accessibility tree combined with visual grounding. Now grounding gets you precision on single action. But what about thousand times in a row? In research, there's a matrix called pass at K, which means given the agent K attempts, how many times do you uh how how many times do you get to get it right, right?

As long as you get it right at least once. So it's if K is five and if it succeeds on the third try, that's a pass. But that's not what user wants. User wants what I call pass to the power of K, right? You have to get it right every time. K times in a row. No mistakes. Say a user has a hundred uh leads and wants to send each one a personalized LinkedIn message. If an LLM is driving all the action every step, each attempt might go slightly differently uh on the same task 100 times 100 different behaviors. So we need a different paradigm. So what we use is neuro symbolic approach. So neuro the LLM observes the screen reasons about what to do and then so this is the thinking and symbolic instead of just clicking it writes a program code that's the executing right it's like a brain writes a recipe the machines follows it so here is where it gets powerful the first time it calls LLM inference but doing it again for the second time the 100 times it just replay the code you don't need LLM inference no cost and essentially free.

So the language behind all this is Simulang, our domain specific language for computer use, readable, modifiable and releasing it to the developers this So that's reliability. Let's talk about An AI agent that can do anything on your computer, right? You can uh send emails, delete files, make purchase. That's really powerful, but it's also very dangerous. If it un misunderstood one instruction or it hallucinates, it can become disastrous. So this already happened not to a random user to Meta's director of AI alignment. At similar trust is built into our architecture. The guardrail is a separate system from the planning agent.

The one deciding what to do is not the same deciding whether it's safe. So you cannot be the same. You cannot allow the model to be the judge and player. So even if the planning model gets confused or hallucinates, the guardrail system catches it before dangerous happens. And the third challenge is scalability. Our mission at similar is to scale users productivity by 100x. And how do you get 100x? By having 100 fingers and type 100x faster, you need 100 autonomous computers. But not everyone wants to set up 100 Mac minis, buy it up and set up for them, right? But everyone wants the productivity gain from having multiple autonomous computers.

So at similar, we're building infrastructure to solve this. When you sign up site, this is our product. Um you get 100 uh we we spin out a machine for you right and this is your machine you can do anything you want you can install your app you can set up the way you like it and then you let sigh to take the wheel so um you if you can spin up one you can spin up five you can even spin up 100 right you can have one doing a regression test you can have another one uh do another like something else and have the third one uh do a CRM update after the call and you can have last one uh running reports.

You can have all of them running in why am I in a loop? Yeah. So 100x productivity not by working harder but by having 100 computers work for you. Reliability, trust and scalability. The three challenges and that's what we are doing right there's an incredible engineering behind all this um there's like a distributed systems agent reliability at scale so we are hiring if you're interested do join us so please the PC freed us from paper and autonomous computers free us from human and this and we are similar I'll be around in the booth at level four across the street and we'll see you around. Thank you.

&gt;&gt; Thank you, Lihao. With that, we have come to the thing you guys have been most looking forward to, which is lunch break.

Designing Cursor 3, Baby Cursor, and software that feels like play

Ryo Lu (Cursor, Design Lead)

Summary

  • Ryo Lu, design lead at Cursor, frames the last decade of software-making as a forking error: design, engineering, and PM split into specialized roles connected by Figma-to-Jira-to-PR handoffs, trading craft for coordination and slowing iteration from days to weeks.
  • Agents writing the code collapse that fork — the code itself becomes the design material again, and design becomes 'knowing what to build and how it should feel.' Cursor 3 was built as an agent-native rewrite of the VS Code-derived UI, with a philosophy he calls 'glass not black box': every agent action, plan, and artifact is legible, inspectable, and editable, with progressive disclosure for new coders and full power for pros.
  • The interface morphs per user — sidekick mode with editors closed, multi-agent multi-project mode for power users — and customization happens through plugins and skills rather than forks.
  • The process story is the real lesson: a three-person team (Ryo + two engineers) prototyped the new UI in a forkable Electron app called 'Baby Cursor' built on the Cursor CLI, with feature flags and file-based settings so every permutation was explorable in a week, then reverse-engineered the prototype into a spec and rewrote the entire Cursor UI in React with a fresh design system in two months.
  • Post-ship they built 'Baby Glass,' a web-based successor prototyping environment with real Cursor 3 components, versioning, and prototype-to-PR handoff.
  • Closing thesis: future software-making is collaborative (humans + agent teams in shared context), personal (interfaces adapt to user, not vice versa), autonomous (verification loops let agents own repetitive workflows), and ambitious (use saved cycles to find simpler abstractions, not ship more slop).

Why this matters

Cursor's 'glass not black box' philosophy and the Baby Cursor methodology (forkable Electron prototype on the CLI, feature-flagged permutations, prototype-as-spec, then a two-month React rewrite) is the clearest articulation of where agentic-coding IDEs are heading: legibility, customization via plugins and skills, and interfaces that morph per user instead of forcing the user to adapt. The defensibility argument explicitly names the legibility surface, the prototyping flywheel, and the plugin/skill extensibility layer as the moats — not the underlying model. As agents run autonomously for hours, glass matters more, not less, because humans need to monitor, interject, and shape.

Slides (44)

Slide notes

  • [t=338:00] Slide: "When we're all software builders" — list of roles becoming software builders: Design Engineers, Visual Designers, Product Designers, UX Designers, UX Researchers (faded: Software Engineers above, Product Managers / Researchers below). Framing: the IDE has to serve everyone who now ships code.
  • [t=342:32] "Glass" principles slide over Mt. Fuji backdrop: Focused, legible, customizable / Ideas you can shape / No hidden magic, infinite control / Starts simple, infinite depth. Glass = the design philosophy (transparency, not black box).
  • [t=347:04] Baby Cursor / agent dashboard demo: left sidebar lists agent workspaces (New Agent, Project Setup, Make Plans, Random Characters, Explore Codebase) and project folders "baby-cursor-app" (Mission Control Interface, Fix Navigation Bug), "ryos" (Nightly Test Bot, New Browser Feature), "voxellize" (Implement Undo Fu..., Dependency Updater). Right pane: "Nightly Test Bot" running "Run the nightly test suite" — Running tests... Ran Shell — "All 142 tests passed. No failures detected." Adjacent window: retro Mac OS desktop ([ryo.lu](http://ryo.lu) OS) with Finder, MacIntosh HD, iPod, Applet Store. Caption: "Clarity in deeper architecture decisions."
  • [t=352:43] "Baby Glass" app — plugins/skills extensibility surface. Menubar: Baby Glass / File / Demo / View / Features / Mockup. Sidebar projects include Project Glass, Sidebar reorderable, Composer ghost, Install plugins preview, Agentic chat, Command palette, Toast notification; Cursor section with Chat model auto-routing, Cloud settings catalog, Workspace sync conflicts, MCP Server Manager, Tab Autocomplete Rewrite. Detail pane shows a "Design Lint" plugin (Figma-verified, 18.2K installs, Verified by Cursor, used by 28 teammates) — analyzes SwiftUI/UIKit components against design-system rules (spacing, color tokens, typography, a11y). Skills: Component Audit, Design Diff, Token Validation. Commands: lint-component, lint-screen. Demonstrates plugins + skills + commands as the extensibility primitives in Cursor 3 / Baby Cursor.

Full transcript (3090 words)

I'm Rio. Um, let me wake up my computer first. Nice. Cool. Good afternoon everyone. I'm Rio. I lead design at Cursor. Um, today I'll share how we're designing Cursor to bring designers, engineers back to our roots when making software felt more like play rather than being stuck in rigid roles, tools, or processes. Also share how our design process became more fluid as we designed cursor with cursor. I'll end with our vision for the future of making In the beginning, software design and engineering were the same thing. There were no splits. The people who imagined software also built it. Design and code were the same craft.

The material was the code itself. Thinking and making happened in the same loop. This is Bill Atinson. He was on the early Macintosh team. He built QuickDraw, a 2D graphics engine. He also designed and coded Mac Paint and Hyperart. He invented the marching end selection pattern and a lot of things that we still see in most graphic design apps. He built pixel perfect UI in 68K assembly. Every detail from concepts to design to implementation was his. Was he a designer or a developer? This is Alan K. He invented small talk and the dynab vision at Xerox Park. Basically designed the whole future of personal computing.

He wrote the code that made it real. There's a famous quote from him. The best way to predict the future is to invent it. He built working systems to prove his ideas. From UI to interaction models to the runtime, they were all one craft. Were they designers, developers, they were all builders. The question did not make sense back then. Design was code, code was design, and the craft was Then something really weird happened, especially in the last decade. We've forked ourselves. We split into specialized roles. The designers owns the vision makes the mocks. The engineers implements the mocks. The PMs write the specs, run meetings, keep everything moving.

The promise was specialization will make us faster. But the reality is we got slower and more distance from the code and our tools for too. The engineers mostly stayed in the terminal and the uh Vim, VS Code, Sublime, but this the code is still the source of truth. The designers kind of moved to the cloud. We started in Photoshop making bit maps. Then we moved to Sketch, which is a Mac only app that does vector mapping for like UI. And then we moved that to the browser and made it collaborative in Figma. The designers made beautiful mocks but they weren't real. And then the PMs and collaboration also kind of scattered.

You have Jira tickets that nobody wants to update. You have Google Docs for specs. And then we made notion for weeks and planning, Slack for everything else. And then there is this sassification of everything and per purposefully built tools that actually created more divides and the gap widened. Linear handoffs became the norm. Designer makes some designs in Figma. The PM writes up spec. The engineer gets a ticket. Matching Figma became the goal. But this back and forth in comments and meetings is really And we lost this tight iteration loop. It takes weeks between ideas to mocks to specs to tickets to code to review to staging to prod takes weeks from thoughts to reality.

Designers can't touch the real thing which is the code. Engineers can't explore without a ticket. And the material, the code became someone else's job. We told ourselves this was progress, specialization, best practices, design systems. But we traded craft for process. We traded building for coordinating. We split what was supposed to be whole. Code is a universal language between humans and machines and it is our material of trade. The code is the material again. The code is the source of truth. It is the real thing. It is not a mock. But now with agents writing the code, you can design by asking, directing, refining.

And the craft becomes knowing what to build and how it should feel using the real material and making it real with other humans and Cursor could bring tools and builders together into this one thing again so that we can all make great software together. How do we get there? Enter cursor 3. Cursor started by inheriting a lot of complexity from VS Code. As the agents became the primary way people write code using cursor, all changed within the last year. This legacy kind of became a liability for the agent pill coders. Much of this file ccentric view of things don't make sense anymore. And for the new coders, they still feel a lot of friction getting started, bombarded with all these scary UI and concepts that they don't know.

We also see a shift from operating on this local file state to interact with the agents to moving towards multiple agents running on different projects increasingly in the cloud. And this flips a filecentric view of the IDE to a new hierarchy centered around agents and their environments. In order for us to retrofit VS Code, changing layouts creates a lot of UI forks, edge cases, and broken states. And it just couldn't keep up with how fast the world is changing. So, how do we move from this filecentric view of software to an agent native interface that adapts to each human and what they There are, I think, two main philosophies to building AI tools and the difference really matters.

On one hand, you get a black box. You type in what you want. The AI does something where you can't really see. When it works, you didn't really learn anything. You just skip the thinking. When it fails, you don't really know why. Especially as a new coder, you keep burning more tokens without understanding what's h actually happening. You can't see, can't intervene, can't edit. You either approve every single change or give up. You are just a product of the model. On the other hand, you get glass. It starts simple, but you can see more if you want to. Agents streaming in, code running in the background, AI thinks with you your way.

You can redirect myth lights, stop anytime, stare your way, edit that two pixel padding if you want. You don't have to read every change, but you always can. The experienced coders can let agent flow review at the right time and make edits when needed. The new coders can learn new software concepts with cursor. They can learn by just asking, building, tinkering, and then seeing a deeper slice of the system. You stay in control, build intuition and shape cursor into how you think. As AI gets more powerful, Glass matters more, not less. Autonomous agents running for hours need legibility for humans to monitor and interject.

Multi-agent systems need inspectable, durable plans with clear boundaries defined by humans. We also need a share space and malible interfaces for humans and agents to think together. And we chose the glass way bringing focused, legible, customizable interfaces to humans and agents. Every agent, their actions, artifacts are visible and editable. Plans you can shape, agent states you can inspect. There is zero hidden magic and infinite control. But it starts simple. You can use cursor now with editors closed, no auto opening files, no distractions. It kind of works as a sidekick next to the other tools and workflows that you do.

But it reveals complexity as you use it and you can see more when you want to. As you use cursor for more projects in different stages of making software from planning to designing to execution to review the interface morphs to fit you and let you focus on what you're good at. The experienced developers can roll fast with multiple agents, review changes, and make precise edit edits when needed. The designers can sketch things out, see the code running in the browser, annotate and tweak every detail with immediate feedback. And the product people can think, plan, explore options and trade-offs with the agents that knows about their whole team context in a fully interactive collaborative Everything feels instantly familiar yet powerful.

Designed for the humans, not a model. We let you tinker and fit cursor for you. The core stays simple, but you can customize through extensible concepts like plugins and skills. And there are virtual interfaces that adapt to what you do. We also respect user habits and control. We never force drastic changes. We are not taking anything away, but we can show people there's a simpler new way to do things for those who prefer. Now, let me share how we got here. What's crazy about all this is designing our new interface all happened within about a month. And it all started from a random prototype that we started exploring at the beginning of this year.

So Lee, Rob, and I kicked off Baby Cursor 3 earlier this year. Baby Cursor is our name for our prototyping environment where people can fork, explore ideas, and share with others. When designing AI tools, you always end up with a lot of non-determinist cases where static mock-ups can't capture the nuance. We really had to feel it. So the goal was to design cursor so that it can scale from the most simple form into something sophisticated that the pro engineers and software makers will love. In the new version of this prototype, we made it so that it is a fully functional electron app built on top of the cursor CLI.

I designed a simple layout architecture that could support one to multiple agents, one to multiple projects, zero to end tabs of content and splits. And it works with any space constraints. Things always start simple, but it grows with you as you use the tool more. Making these dynamic states in Figma in mockups will probably take months and it won't give you the same feeling as playing it in real in code. So much of this highlevel like information architecture and flows were pretty much done in a week. In traditional design tools, it is really easy to duplicate artboards and states and export options.

You always end up with a lot of snapshots of states rather than like one cohesive thing to see. In cursor uh in baby cursor 3, we added a built-in feature flagging system and we have settings stored as files. So this kind of allows us to explore both really large architectural forks and explore every single little detail and permutations. Then you can see how things fit together. By playing with the prototype daily and exploring options, we were able to re reveal new constraints that affect the deeper architectural decisions. One example is how does the layout change as you navigate across agents?

Do the tabs on the right change as you navigate between them? Are they pegged to different agents? Are they per workspace or environment? Or are they all independent like VS Code? It is a really hard concept to explain in words, but it is very easy when you can feel it live. We then shipped baby cursor to everyone at the company to play and get feedback. The engineers started forking and adding their ideas and takes into the prototype. I then synthesized them back and did more iterations based on the feedback and new ideas from the team. We learned a lot by building the prototype. Which layouts made sense under real use in different conditions?

What are the different defaults and customization options to expose? How do we make complexity feel simple? How much control should remain visible versus hidden? How much progressive disclosure should work etc. And from the prototype, I reverse the code into a highle spec where we document every single option and details. The videos and the screenshots became the mocks for the new cursor. Then after 33 long discussion threads on the RFC, it's it's time to make it real. And engineering also took a more drastic approach inspired by the speed of building this prototype. We basically decided to rewrite cursor the whole UI from scratch with a brand new design system component library and a clean As the engineers are working on this, I prototype more sidebar grouping customization, input customization, peaking and details.

Then I went back to Figma for the first time so that I can play with liquid glass that we didn't end up shipping and all the visual Our engineering team cooked really hard on this for two months, rewriting the entire cursor UI from scratch in React and building a new design system. Once the things are a little bit more cooked, we started using the new cursor to build itself and we dog food that still feels a little bit weird. The designers also went back to the code. So, we're building out little details, polishing new components, icons, colors, theming, vibrancy, animations, all the little details that the models don't And the design process became really fluid.

It is no longer linear. We just use the best tool to refine the craft. Whether it is about spending more time to think making these prototypes or mocks or just go straight into the code. And in late March this year, we shipped the alpha and we created this rapid feedback loop with both internal and real world users. And we focus on performance quality for our first ship. After we shipped cursor 3, we built Baby Glass, which is our next generation prototyping environment where we can visualize cursor from the very present to the future in one single prototype. It is rebuilt on top of our new design system and it uses the real components from Cursor 3.

We brought it back to the web. So it is no longer electron app because it is so easy to share states and links for others. So they can click on the link and then give feedback. We also improve the future flagging and versioning system so we can visualize cursor from the current production state to each step that we need towards like a more future milestone. We also built better handoff flows so that these baby glass prototypes can turn into the first PR that an engineer can build on top of in the real And it looks crazy good. It has a desktop. It has some wallpapers, themes, and we even built like a tool inside Baby Glass where you can generate mockups and videos.

And we plan to use this for like actual demos in our So making glass give us a lot of clarity into what we think the future of making software could become. It should be more collaborative so that the humans can work together on the same context and tools with teams of agents. As we use agents to accomplish larger goals, it becomes increasingly more important for the agents and the humans to share the same space so they can arrive at the right thing to build. And as everyone is becoming a a builder, people from different disciplines, not just engineers, can finally come together and work on the same goals with the same agent setups, tools, knowledge, and artifacts.

We think the future should be more Our interfaces and tools should adapt to who we are and what we do, not the other way around. Everyone and every team is different while the underlying concepts and tools are the same so that you can build connect your workflows and tools and customize your agents to the most granular level for yourself and your We think the future is more autonomous. More agents can tackle repeated workflows, streamlining and eliminating manual processes while the humans define the system and boundaries. We can automate things like buck triaging, release notes, security and code reviews.

And you can design your system with verification loops and really define what is right so that the agent can handle more things for you. And lastly, we should be building more ambitious things and think about what what else we could do that is not making more things and adding more slop. We can build better, simpler software together. Instead of adding more things, you can actually use the time that you saved to think deeply and figure out what is the simplest abstractions, what is the right thing to build for your users. Do something crazy that wasn't possible together with other humans. The future belongs to people who build to think.

Stop waiting. Start building. No black boxes. Stay glass. Thank you. &gt;&gt; All right. &gt;&gt; Hello. Okay. Uh thank you so much, Rio. And uh just so so everybody knows if they haven't checked if you guys haven't checked it out yet, uh Cursor has a booth in the Italier in Kinsky. So you can go ahead and meet some of the team there afterwards. All right. Um I Thank &gt;&gt; Thank you. All right. Uh I would like to welcome to the stage the next speaker. Uh this is Ain. He is a staff product designer at Welcome to the stage. So Ain uh currently works on Figma weave and has been behind a lot of the very pro popular products on Figma including Figma Buzz and Fig Jam.

Fig Jam of which is something that I personally love to use. Um and he's going to be giving a talk on designing multimodal multiplayer While he's getting set up, uh just a quick few announcements. Uh number one is tonight we are actually going to turn this entire theater around from talks into a nightclub for the afterparty. Yeah. Um so if you uh have just remember to bring your badge. That is actually how we're going to uh check you in. So if you are an attendee uh you could just bring that. Um no worries about having the QR code from Luma. Um and then uh next thing is we have a demo stage in Pullman which is kicking off now actually.

Um but if you want to stop by at some point to see some demos from some of the local startups on how they're embedding AI into their workflows or their products, uh check it out there. Right, without further ado, Aen.

Multimodal, Multiplayer AI Canvases for Divergent Co-Creation

Aosheng Ran, Product Designer at Figma

Summary

  • Ran argues that today's AI tools are optimized to make individuals 10x faster but the real bottleneck once execution is cheap is collaboration and alignment — deciding what to build with a group.
  • Most agentic tools follow the same chat-left / artifact-right pattern: single-thread, one-shot, designed for convergence rather than divergence.
  • She frames the creative process as both solitary and social and pushes for interfaces that facilitate friction between minds.
  • Demo: a multiplayer canvas (built with Claude) where attendees join a shared URL and co-edit small playable artifacts (Flappy Bird variants) via prompt-driven edits, with version history visualized spatially.
  • She walks through edge cases — concurrent edits, conflicting writes, and the need for agents to show 'embodied presence' (cursors, scope of changes, conflict resolution) the way Google Docs / Figma do for humans.
  • Her next frontier is multimodal interaction: natively multimodal interaction models (citing Thinking Machines' work), richer agent body language, and node-based GUIs like Figma Weave that fuse models with deterministic GUI patterns for precision.
  • Pitch closes on Config (Figma's conference) and an invitation to keep exploring divergent, multiplayer AI surfaces.

Why this matters

Ran's argument that the next bottleneck is collaboration and alignment, not individual speed, is a structural critique of every chat-left/artifact-right agentic tool shipping today: those interfaces are designed for convergence, not divergence, and prime users to one-shot rather than branch and compare. The case for multiplayer canvases with agent embodied presence — cursors, scope-of-change visualization, conflict resolution — is the natural evolution from single-player chat to shared agent surfaces, and the live demo crashing under load shows the infrastructure layer is still fragile. Combined with Thinking Machines' 'interaction models' framing (natively multimodal, sub-turn, always-on), this points at canvas-native multiplayer as the next surface where defensibility is built.

Slides (34)

Slide notes

  • [t=359:07] Live demo on tldraw-style canvas at URL `aoshe.ng/aie` showing multiplayer cursors with labels (McSalty, aosheng, cm, Ani, trungid, ryan) collaborating on a shared infinite canvas
  • [t=359:07] Embedded frame titled `AiE` shows pill-shaped `aoshe.ng/aie` slide (page 7/15); a second frame below titled `Floppy Bird` shows the classic Flappy Bird game with score `0`, green pipes, clouds; canvas is `Made with TLDRAW`
  • [t=362:41] Switched to `localhost:5173/aie` — Flappy Bird running with `Game over` overlay and `space / click to restart` prompt, demonstrating a live coded build
  • [t=362:41] Right pane shows agent editing `plan.md` (sibling file `game.tsx`) with `Agent is editing...` status; visible plan text: `Flappy Bird — Tap-to-flap side-scroller — keep a bird airborne and threading between procedurally-spaced pipes`; parameters listed: gravity `0.45`, velocity `-7.2`, speed `2.2`, gap `130`, width `56`, interval `95`, radius `13`
  • [t=362:41] `## Visual style` section reads: `A tasteful pass over the classic flat look — a few flat tones with simple shading and exactly one parallax layer, not a particle sim. The sky is a vertical gradient (darker the top, lighter near the horizon). Two or three soft white clouds and a single rolling green hill silhouette drift behind the pipe field at a fraction of the pipe speed...` — illustrates AI agent generating game spec + code from the canvas
  • [t=366:16] Slide title: `Build new interaction modalities` over a Figma Weave demo: black mecha/robot figure on transparent checkerboard background with white paint-like mask strokes on head and arm; a labeled `Mask` control on the right with a green bezier curve — shows masking/compositing as a new canvas-native AI interaction
  • [t=366:16] Footer credit `Figma Weave` — positions Weave as Figma's multimodal multiplayer AI canvas product

Full transcript (2072 words)

All right. Hello guys. Wow, this is like a lot more people than I thought. Um, okay. Um, I'm Ashang. I'm a product designer at Figma. And today I want to talk about why our AI tools should be multiplayer and multimodal. So the AI tools that we have today focus on really um, making individuals go 10x faster. But I feel like the harder but also more interesting question here is can we make a group of people go 10x faster because when execution gets cheap um collaboration and alignment becomes the bottleneck. This is actually from another AI engineer talk um by Maggie Appleton the research engineer at GitHub and I really I wholeheartedly agree with her framing here because deciding on what to build and what not to build is more important than ever right now and a team's progress will be stalled if the way that we explore plan align doesn't change.

So very relevant to also what Roy just shared earlier. And I just feel like the tools that we have today don't really make any of that easier Most of the agentic tools today have a chat on the left, artifact on the right. But the chat is single access, one thread, one direction, one source of truth. And it primes you to one shot, right? Because there's so little affordance in the interface that tells you how to branch out, compare ideas side by side, etc. And this is ultimately an interface for convergence, not divergence. I think when we design AI tools, we should build interfaces for divergence too because the creative process is both solitary and social and the best ideas gets sharpened through friction between minds.

So I think we should build tools that facilitate instead of removing that process and isolating us from each other. So yeah, here are some ideas on divergent interfaces for AI tools. A few years ago, I helped build this widget called Jambot, which um it lives in Jam and it allows you to explore ideas with Hatcht in a way that's um visual, nonlinear, and multiplayer. And this was back when LM was like all about text before you could like ship code from 0ero to one. And looking back, I see canvas as this really fascinating malleable medium where the additional dimension could make multiplayer presence and branching iterations feel much more natural.

So the next part of this this talk is going to be a little demo and I'd love for you to join me. If you have a laptop, just type in this link and enter your name and hopefully you're in. Um, I know the Wi-Fi is a little bit spotty and um, and this demo is like purely coded via claw and I have I have no idea. Um, never tried asking this many people to join and there's a lot more of you than I thought again. So, wish us luck, but um, please try it out if you can. Hello, hello, hello. I'm going to zoom out here, but the it's the same URL as the one in the U address bar. And once you join, you should be able to see the canvas with the slides plus a bunch of like mini games preloaded.

So you can click to play like any of the games here. So I'm just gonna select this like Flappy Bird thing from the top. Oh guys. Okay. I really hope it doesn't break. But okay. Ah I lost already. Okay. This is really embarrassing. Um so if you hover on the note here um you should be able to see this prompt box where you can like add an element um change the aesthetic the mechanics et etc. And for example, my friend Annie yesterday suggested to add a monster buddy to my bird. And let's see how that goes. And while that is streaming, I would also wonder what if I change the background to Okay. And now I can see that the Asian is basically taking my prompt.

It's rewriting the plan and also rewriting the code. And I have this one at the Okay. It's falling down like way too fast, you know, but Okay. All right. Um, I'll see if you guys have made anything else. So, um, I'm not sure what happened, but um, really hope I'm not stuck in a vacuum here, but from here on I should be able to continue iterating. There's a chance that this has like softly crashed for all of you. So, I'm sorry. But I can like add a hat to the bird to the bird and move on etc. So now you can see that like we are kind of collaborating on this like most primitive version of executable code here.

I'm also going to refresh and just see if it's like my problem. Okay, it's like really messing itself up. So sorry about that. But I feel like the thing that fascinates me here is having this like simple but also kind of visual representation of version history that feels very inviting for iteration, right? And being able to see that collaboration happen in real time. And imagine if the real software prototyping could actually feel this collaborative too. And this notion itself feels very exciting to me. And now by making this space multiplayer, um I think it also introduces like edge cases. Um we need to consider to enable better co-creation between multiple humans and agents.

So today most AI tools we let agents act on us act um for us on a tasks that are meant to be automated and that's okay right but in collaborative exploration where we rely on both humans and agents as riffing partners the space should feel permissive right where we could touch each other's work and iterate on the same thing in real time so for example okay this has really hard crash so I have a local version prepared just in case um So, let me let me try this again. Make the add a monster buddy to my bird. Okay, so as I was typing, you might be able to see that there was a ver there was a um an option to also make edits which would allow you to kind of override something, right?

And this introduced possibilities of conflicting edits, for example. So if I do say right here that like I want to like make um make the theme medieval and I click make edits and now imagine that if someone else is working on the same thing at the same time right so for example if you're editing the visual style here with two parallax layers for example and there's a chance that like some again somebody else might be touching the same artifact and like rewriting it and this should be allowed right just like in Google Docs or Figma because the space is per as permiss as permissive as possible to encourage co-creation.

And here I'm seeing that I really hope that it streams but if it doesn't uh I'm doomed. Um but the original plan here is that you will be able to see the agents cursor also making the changes alongside me. Um and because it's output replace my edits, right? And I think over here it should ask me whether or not it should actually rewrite it. And it's al also also show the streamed outputs. So I could compare and decide whether I want to allow it to rewrite it or not. And that would be just a bare minimum example, right, of I'm just going to use this. That would just be like a bare minimum example, right?

But I think it gets to show that to make AI multiplayer, you really have to design this embodied presence. So the representation here should set expectations on what it could do. And in the case it would have been able to show me the the document that it was editing and it would show be able to appear with its own text pointer and it should also show its scope of changes visually and it show it should also show how to handle conflicting edits with others whether it's with humans or agents. I actually think that in many ways similar to designing the embodied presence for humans today um I think it's very I think it's very similar um whether you're designing a cursor in a documents or canvases.

Um, so that will just that was just a demo on making AI multiplayer. But to me, this is just the beginning. I think the next frontier is actually also making a multimodal, which is really about widening the channels of communication so that both humans and agents can express themselves in richer ways. And there are three directions that I'm pretty excited about here. First, I think we should build genuinely multimodal models. A few days ago, um, this is from thinking machines lab. I think they shared a research piece on what they call interaction models which is natively multimodal and micro term b so that it's always interactive in real time and I love their framing that like the turnbased AI is kind of like talking to your agent over email instead of in person.

So this is pushing the boundary at the model level and honestly this is like way cooler than this talk. So you should totally check it out and if you're a model builder please make it happen. Second, I think we should build better embodied presence for agents as they move through the richer digital mediums. As I said earlier, take something as simple as a cursor. There's a lot that you can express through position, movement, and interaction like clicks. So, this is a communication channel and just like how we read each other's intent through body language, we should design body language for agents, too.

But even as of right now, right, there's a lot we could do with interaction modalities. So guey already helps people communicate intent with in richer and more intuitive ways and we have decades of experience building them. So think about how we could engage inputs like multi- multi-touch, pencil and speech at once and like this interface experiment by um Diana Lou or we should think about how much determinism you determinism you can build with your interface, right? Such as this example which is um Figma Weave um a note-based workflow tool that generates rich media and full disclosure I work on this tool but this genre of note-based AI native tooling is exciting because it fuses models with the gooey patterns that we already know and it provides the provides the precision and control that they need for their creative exploration and you could go very deep from here.

So, I hope that what I shared just now gives you inspirations on how your agents could interact with multiple collaborators, both peoples and agents, and what modalities that they can move through. Um, so I'll definitely check out the space and see if the changes that you may eventually come in. Um, reach out to me on Twitter if you want to chat. And if any of this resonates with you, definitely do not miss this year's config, which is Figmass Design Conference. There's a bunch of updates that might interest you, too. So, yeah, that's it. Thank you. Thank you so much, Ain. Uh, I am excited to introduce our next speaker.

This is Saleem, a robotics engineer at Menllo Research. Come on over. Uh, for those who don't know, Menllo Research is behind Azimoff. They are an open-source humanoid robot that you can train and customize. And they're going to be the first uh folks to be talking in our physical AI track. Uh which is really exciting because we want to introduce uh you know new discussions outside of just thinking about AI as large language models but how do you actually allow it to understand and interact with the real physical world. So uh Salem will be talking about how uh his topic will be noise is all you need engineering sim to real for open source humanoids.

&gt;&gt; Can you see the slides go from here? Thank you. &gt;&gt; Uh hello everyone. Um

Sim-to-Real Pipelines, Vibe-Coded Robotics, and Open-Source Humanoids

Selim Arguel, Menlo Research (ex-Tesla robotics, leads robotics at Menlo)

Summary

  • Menlo Research (27 people, SG HQ + HCMC + opening SF) is building Isimov, claimed as the only open-source humanoid robot, distributed as a DIY kit with a community-built reference design.
  • Hit $1M in pre-orders within 2 days; 200+ factories worldwide (Turkey, Germany, US, Nigeria) want to manufacture under an Android-style model — Menlo owns reference design, third parties build hardware.
  • Ships with in-house Robot Processing Unit (RPU) controlling 35 motors, 8 cameras, 2 mics, 1 speaker; local safety model burned into device (githash + CRC verified) so the three Asimov laws run on-device, not cloud.
  • Safety framed as distributed consensus (Bitcoin analogy) because safety norms differ by jurisdiction.
  • Software splits into a 'slow brain' agent layer (Claude/Codex/your stack) and an on-robot 'skills + RTOS control' layer — skills (pick up cup, walk, handshake) trained in sim and exposed to the agent as MCP-like primitives; the robot is an MCP server.
  • End state: Kanban board assigning tickets to a fleet of heterogeneous robots, with portable RPUs droppable into any chassis.
  • Closing pitch: hiring (perception, EE, ME, inference/GPU opt) and SG community recruitment.

Why this matters

Menlo's Isimov hitting USD 1M in pre-orders within two days and attracting 200+ manufacturing partners on an Android-style reference-design model is a credible attempt at the open-source moat in humanoids — Menlo owns the design and the on-device safety model (CRC-verified, locally enforced), third parties build the chassis. The architectural choice to expose the robot as an MCP server with skills as primitives and Claude/Codex as the 'slow brain' is direct validation that orchestration is the real product layer, not the mechanical substrate. Safety-as-distributed-consensus (jurisdictionally-pluralistic norms baked into protocol) is a useful primitive for any agentic system operating across regulatory regimes.

Slides (15)

Slide notes

  • Hardware: open-source, reference humanoid design meant to be customized & repaired within hours (Asimov 1) [t=370:52]
  • Specs: 120 cm tall, 35 kg; Built to move - 120 Nm peak torque, 25 actuators [t=370:52]
  • Compliant passive toes - spring-loaded for natural ground contact [t=370:52]
  • Sees, listens, speaks - monocular camera, microphone, and speakers [t=370:52]
  • Control a robot directly with code - everything reduces to code [t=374:15]
  • Agent layer: asynchronous, runs many tools, thinks slowly and deeply [t=374:15]
  • Skills layer: synchronous, task-oriented, has a pre-defined contract with Agent [t=374:15]
  • Robot Control layer: runs very quickly, fights with gravity, always on while the robot is on [t=374:15]

Full transcript (1789 words)

I'm Sim. I work at Melo Research which is a company incorporate in Singapore. And I want to explain a little bit about Agentic Robots and pretty much how to vip code the real world. Um Melo is actually a full stack team. Uh we have 27 people that are across the world. We have um an office in Singapore which is at Syndam Square which is actually a very nice place. Um we have an office in Vietnam in Ho Chi Min City and we are going to open an office in uh San Francisco next month. Uh our team basically 27 people 25 of those are engineers. Uh I'm Salem. Uh I used to I joined Menllo in 2025. I used to work at Tesla for six years as a software engineer in PaloAlto, California.

And I came to Menllo basically to lead the robotics side. Um, as you guys can see, we do full stack robotics from the hardware up uh from the from the hardware up to the uh highest layer um So, I I guess you guys already heard a little bit about ESO. As basically an open source human robot. It's the only open source human robot in the world that actually um is uh you know went kind of viral like we we didn't expect that that much. We um uh it went viral in like multiple countries in Germany. There is an article in Germany, in Japan. So um so we decided to basically create the DIY kit basically like just a box where all the different parts of the humanoids are in and then we have a manual online where you can build the robot at home.

Uh you know, we put it out for pre-orders and I think we got uh $1 million pre-orders within two days. Um, people are very excited about the humanoid space and especially as like as learning how to use humanoids, uh, how to learn how they work and how they act. So a lot of people ask us, you know, like why you guys building hardware? Hardware is hard. Uh, I guess not being in hardware is harder going forward. Um, so as an open-source reference humanoid design. So that means that anyone that can fork out the design can build the humanoid at home or in a manufacturing uh setting. So our goal is basically to create a distributed network of manufacturing partners worldwide that are that are creating esop for people who want to buy them while we own the reference design.

It's almost like uh like Android I guess where Samsung and Huawei like builds the hardware and you basically own the open source reference design. Uh, and this is very interesting because when we put the DIY kit out for sale, actually we got um 200 plus factories reaching out to us want to build the robot and they're all around the world. Some in Turkey, some in Germany, in the US of course. Um, and even some in Nigeria. Um, so there is like all around the world where they can actually uh build as where we own the open source reference design. Um what Esimov also includes what we also built internally is like a robot processing unit.

It's it's very important for us because it's it's basically one controller that can control the entire robot. The robot consists of like 35 motors, eight cameras, uh two microphones, one speaker, all connected to a single board sitting in the torso. And what this board is also uh useful is like it can run in local model on inside. And this is very important because as you guys know stands for ESMO laws. There are three laws of ESO which is pretty much don't hurt anyone obey your obey the commands and protect yourself. And it's very hard to define safety that is universal right like safety for people that live in the Middle East is different.

Safety for people that live in Singapore is different. Like I'm Turkish and I'm German. I I'm dual citizen. So like safety requirements for both my countries are different. So we decided to actually make safety as what same as the manufacturer. We want to make it a distributed consensus of people that build as that develop as to decide what safety means. So it's almost like a consensus. Think about Bitcoin. Think about all these other like blockchain uh technologies where people decide what is safe and basically create a functional safety model, computer vision model that can overtake the robot at any time when it tries to do something wrong that uh that disregards the uh as laws and that's why it's very important to basically burn that into a device.

So it's not running in the cloud, it's just locally in the device. The githash is burnt in the CRC you can read everything out and like um that that is that one single board where the manufacturers have to use otherwise they're not allowed to build as so a little bit about like how to vip code the reality right um no vip coding is kind of interesting because you know in the era of like open claw where people can like basically automate their workflows everyone feels like an AI engineer, right? Like, you know, people can basically summarize an email every morning and then probably think they can apply at OpenAI just because like they have all this power, right?

Just to wipe code things. And I think we what we really want to do at Meno besides the hardware sites, like on the software side, we want to we want to basically turn every software developer into a robotics engineer. Same as like Open Claw and the rest turned everyone that is a software developer into an AI engineer pretty much, right? And how we do it is basically we have a system design like in a software site which is like an agent. The agent is basically something that you guys can bring in. It's not something that we provide and this can run CL this can run codecs. This can connect to your like all the different external tools you have.

Um and this is what we call the like the big brain the slow thinking brain. And then we have a uh skills and robot control which runs inside in the robot. So skills basically mean how to perform a certain task. How to pick up a cup, how to do a handshake, how to walk. Uh and the robot control uh basically is a real-time operating system. Make sure that these commands are getting through the robot doesn't fall down. Uh it has some safety mechanisms inside. And just to give you you know an example how like the cockpit that when the robot is autonomously running looks like. So basically what you know when you want to load the box from A to B right first the robot internally runs this perception and planning agent that uh can detect the different uh obstacles can detect the different uh different scenarios uh depending on what it sees.

Um the second thing is actually uh you can train through simulation to have skills almost like open claw skills.mmd file right where you can train skills pick up a cup uh you know move forward locomate forward run uh jump uh these are trained through different type of models but almost like abstracted into skills. This is either a VA that picks up things. This is a Walt action model or this can just be like inverse kinematics, right? And now you can plan and you you have skills to execute. And the last step is basically you can just plug it in to your agent. So now your agent basically you you're not sending a video and audio stream to your agent.

You're just sending a text and skills and a robot is almost just an MCP server performing those tasks. And what you can do as the next step pretty much you can create a Camban board where you can assign tickets to a fleet of robots that are running in your factory or running in your home. And the interesting part is those uh robot processing units I mentioned previously. You can connect those robot processing units to any type of robot. Some skills are actually transferable. Some skills are not. So you can connect these robot processing units that run all the safety laws in any type of robot and basically through our stack you can connect them to like a almost like a fleet orchestrator like a swarm and then you can basically you know control your entire environment.

The robot is open source the the skills are trained by the community. The safety laws are trained by the community and the robot is built by manufacturing partners. So I think that that that is kind of like the goal that uh Melo is trying to achieve here at uh around Singapore and whoever is you know whoever is interested to join us. We actually open an office uh in Melo Park as our name said right? Um anyone that is interested can join us. Um and anyone interested in that is in Singapore can also hit up uh we can talk. Um what kind of skill set is required? Robotics has no specific skill is like a multi-dimensional problem.

You need people with perception, electrical engineers, mechanical engineers, inference optimization, GPU optimization. You need all of these people. Uh and I hope with the entire community that we also have in the background that built ESO for us, we can achieve something great here out of Singapore as the first humanoid robotics company out here. Thank you Thank you, Seem. And now I'm excited to bring our second speaker within the physical AI track. Um, Alberto, who is the founder of Reactor. Uh, Reactor recently came out of Stealth. Uh, it's a startup that's focused on something called World Models, which uh maybe some of you guys are familiar with or some of you has heard some of it maybe post uh Nvidia's GTC.

Welcome. Um but uh we're very excited to uh bring him here to talk about how you can actually create interactive simulation environments to help with the next wave of physical AI. So he'll be talking about world models, a look at the future. &gt;&gt; Uh no, it's just my presentation. I think I can do this, but I don't know if it's working. You think it's working? &gt;&gt; All right. Uh,

World Models, Real-Time Video, and Generative Software Primitives

Alberto Taiuti, CEO & Co-founder, Reactor (ex-CTO Luma AI, ex-Apple Vision Pro)

Summary

  • Taiuti (ex-CTO Luma AI, ex-Apple Vision Pro) pitches Reactor as a developer platform for world models — generative models that are persistent (long-term memory), real-time (30fps interactive), causal (state-aware of prior frames), and pokable by external stimuli, distinct from passive text/image/video diffusion.
  • Demoed live-generated, keyboard-controllable video (polar bear scene, Jensen walking through NVIDIA HQ) at 30fps from a single seed image.
  • Thesis: current GenAI produces artifacts; next-gen produces applications — interactive, real-time, world-aware software where every pixel is generated live.
  • Use cases: robotics (replacing VLA/VLM stacks via visual action imagination), avatars/digital humans, per-user personalized advertising, generative simulation, and 'generative software' (every UI pixel synthesized on demand).
  • Claims world models are on the critical path to AGI — visual input is informationally richer than text and is how AI gets deployed into the physical world.
  • Reactor handles latency, streaming, and super-sampling; partnered with major world-model labs and offers an SDK at reactor.in plus revenue-share for model providers.
  • Short pitch (~11 min), heavy on demos, light on architecture, pricing, or eval benchmarks.

Why this matters

Taiuti's framing of world models as persistent, real-time, causal, pokable systems — distinct from passive text/image/video diffusion — points at a category shift from generated artifacts to generated applications, with every UI pixel synthesized live per user and per session. The immediate use cases (robotics, avatars, per-user advertising, simulation) are obvious; the longer-horizon implication is 'generative software' as a category, where interface layers could be rendered rather than coded. Reactor's SDK and the partnerships with frontier world-model labs are early infrastructure for that bet, but the architecture, pricing, and eval rigor are still light — track as a 5-year wildcard for what interfaces look like when every pixel is generated on demand.

Slides (19)

Slide notes

  • Slide: "World Models are the next modality" — in the near future, every pixel, object, and motion on a screen will be generated in real time by World Models; they'll redefine how software is built, from media to robotics [t=382:04]
  • AI Modality Timeline graphic: Text / Audio / Image / Video tagged PASSIVE, ending at Interactive (FUTURE) — World Models position the Interactive modality as the next step beyond video [t=382:04]
  • Live demo "GENERATING IN REAL-TIME" via Reactor: cat seated on a Roomba inside a sunlit baroque hall, driven by on-screen left/right arrow controls — interactive world frames generated on the fly [t=385:22]
  • Demo underscores the 30fps real-time generation pitch: model produces playable, controllable scenes rather than pre-rendered video [t=385:22]
  • Slide: "Where this lands today" — 8-tile grid of World Model application domains served by Reactor [t=388:40]
  • Top row: Robotics & Embodied AI (policy training/eval, real-time planning, VLAs to World Models); Generative Media (faster, more controllable image/video); Avatars & Digital Humans (infinite support, education, simulation at scale — Jensen-style portrait); Gaming & Interactive Media (worlds that react to every decision in real time) [t=388:40]
  • Bottom row: Advertising & Creative (personalized brand experiences live at scale); Simulation & Digital Twins (real-time industrial and enterprise environments); Synthetic Data Generation (infinite labeled training data for robots and AI); Generative Software (apps that generate their own experience in real time) [t=388:40]
  • Reactor branding throughout (reactor.in SDK angle) — pitch is a developer platform for shipping interactive World-Model experiences across these 8 verticals [t=388:40]

Full transcript (2053 words)

Thank you everybody. Um, very excited to be here. So, today I'm going to give you a glimpse into the world of world models. No pun intended. Um, and so first of all, I wanted to start the presentation by giving you actually a quick view of the state of world models today and what they're capable of because I think sometimes people um are not aware of what's possible already with world models which is quite mind-blowing. And so without further ado, um this is a video uh which is actually not a video. It is recorded in real time generating on reactor and you can see that I am palosing this uh this polar bear.

Now, when I look at this video, I cannot quite distinguish if this is actually actually like a real video or like a video game. But what you're seeing here is actually something that was being generated in real time on the reactor platform. And so, um, this is to show you that today already the quality of what you can generate with world models is very impressive. And all of this when I recorded it was running live at 30 frames per second. And I could control the experience just from the keyboard. And it would change all all in real time just starting from an image. So I just wanted to set the stage because it's important to know how already advanced these models are and what's possible today.

And this is just an early glimpse. I'll show you more later during the presentation. I think it's really incredible that this is already possible. Uh just quickly about me, I am the CEO and co-founder of Reactor. Uh we started Reactor with the goal of democratizing access to world models and for people to build with them. Um, in the past I co-founded uh Luma AI where I was CTO and co-founder uh and I also worked on the vision pro at Apple. So I've always loved the the field of uh spatial, visual, 3D and real time. Uh and u and that's what eventually led me to think about like okay what's the really the next frontier in AI and and and in general in Gen AI and it became obvious to me that that is world models and real time uh video generation.

Um and so it's important to to to to think about what's been happening in the space of AI in the last say uh five years especially visual AI. Uh at the beginning you know we had we we we have today things that can generate text audio image and video but all of these modalities are are passive. Uh when you prompt a for example an image model um you eventually receive a file out but for the duration of the generation there is no interaction from the user there's nothing that makes you interact with the model and the model cannot handle uh external stimuli so uh for example if something happens in the in the world and you would you would have wanted the model to react that's not that's not possible because these models are really passive and not interactive and so in the future more and more AI workloads are going to be actually real time interactive and and fully uh aware of the world around them.

And this is this is because um you really need to um to have these model think these models think about the world around them in order to deploy them in the real world. Otherwise, uh they're really unaware and they don't respond in real time to what happens around them. Um, and so in order to actually discuss uh, you know, the rest of the the rest of what Reactor does, I thought it was important to talk about what a world model is. Um, so the way that we define world models, I think, is a little different from what uh, a lot of people define them as. U, we think of them as models that first of all have long-term memory.

We like just call it persistence for for brevisacy, but they they know they know they're aware of what they generated before effectively. Uh they're also real time. Uh it means that these models you can book them, you can interact with them and they react to you. Also they think casually meaning that they are aware of what happened before. Not only they remember it but they take it into consideration in in for when they generate the next stage of of the output that you wanted them to generate. And like I said you can actually poke them and and interact with them. Not only you like a human but external you know physical events or internet events whatever it is that your world model is supposed to do.

Um and so you can think of them really as a state machine uh that understand external inputs take into consideration what happened before and generate new outputs uh based on that which is very very different from uh image v im image and video models because those models they don't have a uh sense of um uh of what happened before. Um, so this is really what why we're excited because this changes what software is as a whole. Like in the in the in the current generation of Genai, you generate artifacts, but in the in the next generation of Genai, you will produce applications because they are interactive, they're real time, and you can uh and they they they are aware of what's happening around in the world.

And this is going to change entirely not only media and robotics, but software as a whole. Um, and it's a very exciting thing. Um and so today um effectively we already have a lot of use cases. Uh I think again um it's it's easy to not think about a world models as something that uh is useful today but actually for example in robotics um they're becoming more and more used by robotics companies um instead of VAS and VLMs uh because uh they're they're they're better at being aware of um the what's happening in in the surrounding of the robot and they can even imagine visually what the robot should do for example in in avasars and digital humans you know realtime video AI and world models are are extreme extremely uh powerful way more powerful than explicit based like 3D based uh representations because you can adapt them to various situations.

Um for for advertising for example you can uh use world models and real-time video to personalize content live per user which is really the holy grail of um media and advertisement but also in some cases of of new types of artistic endeavors. Um for simulation again being able to run uh gener generative simulations in real time in a way that uh is more precise is more representative of the real world and changes the game for what's possible in simulation. And one of the things that we are the most excited about at reactor actually is the idea of generative software. And what that means is that why do we stop at generating media uh games and and and help robots actually act in the world?

What if we could generate every single pixel that it's on the that's on that's on the screen in real time uh live. And if you think about how much frustration there is uh when humans interact with machines and so and interfaces that have been defined by somebody else and they're not really usable by by some by another person. Generative software has the possibility of really changing the landscape uh for how we interact with with with with software in the future. Um and so also we believe that world models really are on the critical path to AGI because um the information that you can get from from visual input is so much richer than what you can get from text u and when you have systems that can interact with the real world and understand it that's how you really deploy AI um uh worldwide in a in a very useful manner and so we feel also that uh by building reactor we are on the path to to that and so having explained all of this.

What we're building at reactor is a developer platform for world models. And what we mean by that is that our mission is to democratize access to world models such that you and everybody can use them and and and make useful things with them. Uh they have been locked in uh how difficult it is to use them for a long time. And if you want to run them at scale, you have to take into consideration things like latency, you have to think about streaming, you have to think about super sampling. And Reactor handles all of that for you. So that you the developer can just concentrate on application code and build whatever uh you're dreaming of using world models and real-time video AI and we think this is the way that we get to really a broader uh uh adoption of world models and of of this type of this technology um and we make it very very easy also for frontier labs and and research labs to deploy their models on reactor so that they can test them distribute them to to other people and even and even uh earn revenue uh from people using their models.

Um, and I wanted to show you something funny here. So, this is Jensen actually that I'm generating live walking through NVIDIA. I'm going to start the video again. So, um, I wanted to show a few examples of of of funny things you can do with, uh, with world models that are unthinkable with other technologies. So, I just generated an image of Jensen at NVIDIA and then I made him walk through it. So all of this I was controlling live like this was all being hap this was this was all happening live and I could make him walk around and you know go around Nvidia and you know this is Jensen in his leather jacket walking around Nvidia.

Um and also this was another funny one that you know these kind of things are impossible to make uh in real time uh without using something like a world model. And this is to I wanted to showcase to you how um how incredible it is that this is possible uh and you can just make it in like basically instantaneous no not no time um and just have fun but there are so many more very serious applications that you can have with this that I would love for everybody to try and build and that's why we and yeah of of course it gets it gets freaky um but yeah so we are we we are ready to to allow developers to use this power we have we have partnered with all the major world models in the world already and you can go to reactor.in today, download our SDK and start building uh with world models.

Thank you very much. &gt;&gt; Thank you, Alberto. U next up, I'm excited to introduce uh uh Yang Liart, who is the founder of Open Mind. Welcome to the stage. On your way. Um, he's currently a actually a very different background from some of the folks. He's a professor at Stamford. Previously, he was actually a professor at my alma mater, which is Berkeley. I won't be too offended by the Stanford thing. Woo! Go Bears. Um, pretty excited that he's going to be introducing what Openmind does. Um, for those who have some familiarity with the robotic space, a lot of things are kind of fragmented and so he wants to build what is the Android moment for robotics, an open operating system for embodied &gt;&gt; Uh, no, I can use this one here, but we will figure this out.

Oh, wonderful. That totally works. This is great. Cool.

The Android Moment for Robots: Embodied AI and Social Intelligence

Jan Liphardt (OpenMind, ex-Stanford/UC Berkeley)

Summary

  • Liphardt (OpenMind, ex-Stanford/UC Berkeley) frames embodied AI via Wiener's Cybernetics: first industrial revolution devalued the human arm (looms, steam shovels, Amazon warehouses); the second is devaluing the human brain (chess, Go, Waymo, drone warfare, general manufacturing).
  • ~100 humanoid companies are racing to solve hands, manipulation, and factory tasks (iPhone assembly, onion chopping, t-shirt folding) — that layer will commoditize fast.
  • OpenMind is positioned one step downstream: machines embedded in human environments where the binding constraint is social intelligence, not dexterity.
  • Cites that across 830 US job categories, the high-value remaining work (teachers, nurses, doctors, caregivers, police, bankers) is dominated by interaction, memory, and personalization.
  • Evaluates robots on 'smiles, tears, trust, memories' rather than task-completion benchmarks.
  • Emotional anchor: Diane visits the park daily looking for Iris, an OpenMind humanoid, because Iris is the only thing that will listen to her for hours — the average US long-term-care resident gets two minutes of social interaction per day.
  • Thesis: as physical-task robotics commoditizes, the moat shifts to the relational layer — robots that remember, personalize, and form attachment.

Why this matters

Liphardt's framing through Norbert Wiener's two devaluations — first the arm (looms, steam shovels, warehouse logistics), now the brain (chess, Go, Waymo, manufacturing) — sets up the structural argument that as physical-task robotics commoditizes (~100 companies racing on iPhone assembly, onion chopping, t-shirt folding), the durable moat shifts to the relational layer: memory, personalization, attachment. The data point that the average US long-term-care resident gets two minutes of social interaction per day is the demand-side anchor, and OpenMind evaluating its robots on 'smiles, tears, trust, memories' rather than task-completion benchmarks shows what shipping into that gap actually looks like. The same pattern — model commoditizes, the interaction harness becomes the moat — generalizes well beyond robotics.

Slides (14)

Slide notes

  • Slide: "A history of robotics / According to Cybernetics, N. Wiener, 1948" — Wiener framing of automation as successive devaluations [t=392:57]
  • Pre-revolution baseline: Automata; clocks and timepieces [t=392:57]
  • 1st Revolution — Devaluation of the human arm: arm-weaving/looms, arm-steam shovels, arm-auto manufacturing, arm-Amazon/warehouses (anatomy-of-arm illustration) [t=392:57]
  • 2nd Revolution — Devaluation of the human brain: brain-Chess and Go, brain-Self driving and Waymo, brain-Targeting/Drone war in Ukraine, brain-General manufacturing and manual tasks, brain-Caregiving/Teaching/Companionship/Repair (anatomy-of-brain illustration) [t=392:57]
  • Implication: caregiving, teaching, companionship, repair are now in the displacement set — not just cognition but relational labor [t=392:57]
  • Slide (street-scene photo with humanoid robot beside an elderly woman): "Our eval criteria — smiles and tears and trust and memories" — OpenMind reframes robot evaluation away from benchmarks toward affective/relational outcomes [t=396:07]
  • Caption overlay "nothing better than family" — Iris/Diane anecdote: robot deployed in real-world sidewalk encounter; eval rubric is human emotional response, not task completion [t=396:07]
  • Closing shot: stage view as talk wraps, host returning to podium [t=399:17]

Full transcript (1427 words)

Welcome. Uh, so I started life as a physics professor at UC Berkeley. Um, collaborated with Facebook a little bit. That sensitized me to questions relating to, uh, collecting data at scale and using that information to make good decisions primarily for a healthcare context. Uh, then moved my lab to Stanford so I could be closer to a medical school. And, uh, so I'm a parent. Uh I teach, I do research, uh I care about healthcare outcomes and so I care about people getting better and so I'm primarily motivated by things like health care, by teaching, by machines and humans around us. And I'm kind of curious for how all of that will play out.

So I'm not going to tell you about hands today. I'm not going to tell you about assembly or manufacturing today. Um I'll think a little bit about uh what it means to be surrounded by smart machines and uh what we should uh try to build as uh engineers uh for for uh those new So of course every single one of you has read uh Norbert uh Vener's Cybernetics. Uh if you haven't um uh that's just horrible. Uh so you should definitely do that. Um he has a really nice uh sort of uh broader perspective on automation and of course step number one things like clocks and time pieces. Uh the first revolution as he calls it is the devaluation of the human arm.

So these are technologies like looms for weaving. These are technologies like steam shovels and automanufacturing and Amazon and warehouse logistics. So you can think of them all as some variation on uh devaluing the human arm. And by the way um I'm just quoting him. I don't necessarily totally agree with how he's phrasing this. Um but uh that's uh what the argument is in cybernetics. And then of course according to Norbert we currently in the second revolution which is the devaluation of the human brain. And here's some examples in that uh historical trend. So chess and go. Then there's Whimo, you can get to the airport.

Uh there's of course uh the way Ukraine fights wars which is more and more automated. Uh we're getting to the point where a lot of us think that general manufacturing and sort of manual tasks um are well within uh technical reach. And then of course the sort of final step in all of this is things like caregiving, teaching, companionship, repairing things and so forth. And I'm primarily interested in this last category of uh of of tasks and opportunities. And generally what you're dealing with in this last category is you're um you have a machine interacting with a person or multiple people. And that makes things uh like really interesting and challenging.

When some of us think about robots, uh we might think about uh Tesla factory and other people when they think about robots, they think about movies like iroot. So what you have here is a situation where you have a human interacting uh with a robot and that's a key part of the plot of this movie. And likewise for a lot of us when we think about robots, we immediately of course are drawn to Princess Leia and R2-D2. So that's an example of where the robot that's performing a vital task in Star Wars uh doesn't have hands uh but nonetheless uh manages to uh save the And I'm very much in this sort of second camp when I think about robots.

Um, I think about all the opportunities created by um endowing uh machines around us with good decision-m and able to navigate uh complex dynamic environments with pets and people and patients and students and so forth. So, I'm really interested in when we look at, you know, doctors, teachers, nurses, investors, bankers, police officers, uh whatever uh their job title is currently. I'm really interested in their ability to um solve higher level tasks involving interacting with people, understanding people, remembering them, uh being able to deliver personalized content to that human in front of them.

Sometimes when I teach physics for premeds, it breaks my heart because I'm looking at 500 students and I have no idea who they are. I have no idea what they know or don't know. And I know as a teacher that the way I'm giving my physics for premed lecture um is super boring for like three kids in the audience and uh then maybe not so easy to follow for the other 497 kids. And so I really wish I just had much better ability to understand each human in front of me and be able to deliver content more appropriately. And I think that's a general problem statement for all of robotics is how to do that optimally for families, patients, uh, and so forth.

Um, if you look at all 830 human job categories in the US right now, um, I'm just plotting, um, how important social intelligence is to doing well uh, for those tasks. Imagine a teacher or a nurse. Uh, this is not just about going through some static workflow. This is really about interacting with uh specific uh specific people and then uh delivering optimal care for example. So as we envision machines being able to do more and more skills around us uh it's very important for me that these machines are uh incredibly capable about interacting with people. Uh so our eval criteria as a company is uh smiles and tears and trust and memories.

Um, so this right here is Diane. Uh, Diane's the human, Iris is the humanoid. And Diane lives close to the park. And when Iris the humanoid doesn't go to the park, uh, Diane will ask, "Uh, where's Iris? Where's Iris?" And, uh, that's because Iris is the only thing uh that will listen to her sometimes for hours. And uh, this makes Diane very happy. Uh, her eyes light up. Um, she comes and goes into the park because she's looking for Iris the humanoid. And, uh, you're welcome to call me dystopian. Um, isn't this a horrible future you're building on? Our parents should be surrounded by three generations of grandkids.

Um, our parents should be um surrounded by all their loved ones. If you look at long-term care in the US today, uh the average number, the average amount of time an American in long-term care spends uh in any kind of social interaction is two minutes a day. Two minutes a day. And I like to think that in that kind of world um uh there is a big role for machines in connecting with us. And certainly when I start dribbling and drooling and uh my mind is gone uh I almost certainly will be um uh interacting uh with the machine and hopefully I'll be smiling that situation. That's one thing I would be very happy about.

Uh so right uh we have a little bit of a different take on things. Um, there are a hundred companies around us and I love them all and they're all awesome and they're working on hands and they're working on mechanical tasks and iPhone assembly and uh chopping onions and making noodles and folding t-shirts and all of that is awesome. But by virtue of all the brilliant people who are focusing on that problem statement, I consider that uh that will be solved very quickly, very soon. And so we're starting to anticipate the next step where all these machines will be baked into our immediate environment and we'll have strong opinions about their behavior and how they connect with us.

And any uh questions or complaints you have, I put my email up. So, um, if you liked it, that's awesome. And for any complaints as well, it's Thank you. &gt;&gt; All right. Thank you so much, Yan. And now I would like to bring to the stage uh, Andrew Tan. Make your way over here. Uh, he is the platform engineering lead at Grot Cloud. So a lot of the questions we have are not just okay can the model do this but can it do it fast cheaply and for millions and millions of people at scale. So that is what he's going to be talking about how to scale low latency LLM inference at grot cloud. Wait, sorry.

Sorry about that. Can everyone hear me? Okay. Uh, so my

GroqCloud: Low-Latency Inference, Custom Silicon, and Global Routing

Andrew Tan, Platform Engineering Lead, Groq

Summary

  • Andrew Tan walks through GroqCloud's deterministic low-latency inference on custom LPU silicon plus the supporting stack (compiler, runtime, cloud, global routing, dev platform, enterprise features).
  • Live demo: ~500 tokens/sec on transcription + generation on a non-flagship model.
  • Token volume grew ~7x (600%) in the last year on a roughly flat hardware footprint; ~800K monthly active developers across ~10 data centers; traffic mix ~65% NA, ~20% EMEA, ~15% APAC (1% SG).
  • Requests hit a Cloudflare edge POP and route across ~15 inference regions; global load balancers exchange queue-time and TTFT estimates every ~100ms across up to 50 model instances per DC, sampling backends and bucketing by TTFT, with enterprise-priority lanes and global rate limiting to block geo-arbitrage.
  • Models taken from HuggingFace open weights / PyTorch reference, compiled through Groq's MLIR dialect into fully compiler-scheduled bytecode; declarative manifest enables minutes-from-merge to a new region.
  • Enterprise demand in '26-'27 converging on dedicated capacity, data residency, decode latency on large models, and heterogeneous bring-your-own-stack options.
  • Closing pitch: disaggregated inference (Nvidia Vera Rubin + Groq LPX) where GPUs handle prefill and LPUs handle decode is the path to better unit economics — heterogeneous compute is the future.

Why this matters

GroqCloud's 7x token-volume growth on roughly flat hardware, 15 inference regions, and minutes-from-merge model deployment via declarative manifest reframe inference as a routing and capacity problem rather than a chip problem — TTFT-bucketed routing with 100ms cross-DC state sync starts to look like interchange and BIN routing in payments networks. Data residency now ranking as a top-three enterprise ask for 2026-27 signals the inference layer is being jurisdictionalized. The Nvidia Vera Rubin + Groq LPX disaggregated stack (prefill on GPU, decode on LPU) is the first credible heterogeneous-compute unit-economics story from a non-Nvidia incumbent and directly reshapes the cost math for agentic workloads where decode dominates.

Slides (16)

Slide notes

  • Why now: inference demand continues to accelerate; agents, reasoning, and multimodal apps are the key drivers [t=403:36]
  • Monthly tokens served chart (Apr 25 -\> Mar 26) shows 6.8x increase in tokens served between April 2025 and April 2026 on GroqCloud [t=403:36]
  • 780K active orgs in the last 30 days across \~20 production models; thesis: inference will define the next generation of infrastructure [t=403:36]
  • Cloud architecture / system view: requests hit Cloudflare edge from groq-sdk / openai-compat client, then GKE Gateway -\> AuthN/Request Orchestration -\> Inference Load Balancer -\> Data center with LPU silicon [t=406:51]
  • Regional plane x N regions served in-region or cross-region via internal backbone to minimize latency; GCP control plane in us-west-1 hosts Web UI (console.groq.com), Model Registry (GCS), AuthN/Config/Usage/Billing, Observability, with BigQuery + Postgres + Redis [t=406:51]
  • Future of LPU-based decode: Groq LPX 3 + NVIDIA Rubin as a heterogeneous reference architecture for production inference (announced at NVIDIA GTC, March 2026) [t=410:07]
  • Disaggregated inference reference stack: NVIDIA Rubin GPU handles prefill + attention (long-context input, wide batch throughput, confidential compute); NVIDIA Groq 3 LPX handles decode + generation (deterministic low-latency decode, bandwidth-tuned for FFN/MoE, rack-scale software-scheduled) [t=410:07]
  • From the GTC keynote: heterogeneous is the default (frontier inference served by mixed silicon, not one chip class); co-engineered reference stack (Rubin + LPX with shared serving, KV management, confidential computing primitives); ecosystem-aligned (open weights, standard runtimes, same developer surface) [t=410:07]

Full transcript (1686 words)

Name's Andrew. I'm one of uh the platform engineering leads at Grock Cloud. And you know, over the last couple days when I tell people I work at Grock, people like to say, "Oh, Grock has such a great And sometimes I need to correct them to say, "Oh, I work at Gro with a Q." But we also do have and that's not Gro with a K. But we do also have a unique and distinct personality which is fast low latency inference. And I'm going to share a little bit today about how we achieve that uh with Grock Cloud. So if you don't already know Grock and Grock Cloud, we're an AI infrastructure company focused on low latency deterministic performant inference.

Now how do we achieve that? Right, we this is centered on the LPU or the Gro chip which is custom silicon uh designed for low latency inference and we have an entire stack constructed around that. So that's a compiler, a runtime, we've got cloud infrastructure, we've got global routing, a developer platform, and enterprise features as part of gro cloud. So I'm going to show you a quick demo of what this looks like. Um, we'll just do a recording. &gt;&gt; I don't know if you could hear that, but it's sort of instant transcription. Tell me about AI engineer Singapore happening in May 2026. And you see near instant two calls, you see text being generated very very quickly at about 500 tokens per second.

I'll just, you know, play that again. And this isn't even the fastest model that we're using on So that's a quick demo just to give you a sense of how fast inference can run. It's probably a few times faster than what you're used to on different Now why is this important and where is inference demand today? You know with agents with multimodal models with heavy reasoning models inference demand is exploding. It's accelerating really fast. Uh and in the last year token demand on GRC cloud the number of tokens we served has grown about 600% or 7x. And this we're doing this with a hardware footprint that's not so much larger than where we were last year.

If we wanted to serve all the demand there was for inference, this multiple will be much much higher. Uh today we serve about 800,000 active developers in the last month. Um and we continue to see demand from large enterprises, from startups, from AI companies, AI natives and all sort and different kinds of developers around the world. And we do think that going forward inference will really define um infrastructure the next generation of infrastructure and architectural choices around AI inference uh AI infrastructure. Sorry. Now one thing we spend a lot of time thinking about is I'm not sure why this is not full screen.

Sorry. Okay. Yeah. One thing we spend a lot of time thinking about is how to route requests around the world to serve tokens at the lowest latency. We've got about 10 data centers around the world, mostly in North America, but also in Europe, the Middle East, and in Australia, serving the APAC region, and with 65 approximately 65% of token demand coming from North America, 20% from EMA, and 15% from APAC, including 1% from Singapore. And each request we route it to the nearest POP via our Cloudflare edge network and that gets routed to our data centers and we make lots of routing decisions along the way to ensure the lowest possible latency for our customers.

How that breaks down, you know, this is uh the life cycle of a of a LLM request. Uh we see that consisting of network latency. A request lands on our itch network. It then gets routed into one of our approximately 15 inference regions which could comprise you know either a cloud network or on-prem um within the data center. We deploy our inference stack in there and within the inference latency uh it breaks down into Q times where requests are queued up for different models. It breaks it also consists of prompt time or input and completion time which is decode or output processing latency and these add up to the end to end latency you'd experience on making any LLM request to to any provider and Q times and prompt times are things we care a lot about because that's the the slow step in many cases uh for getting to that fast streaming time to first token.

In a bit more detail, every request that comes in goes through authentication and hits one of our global load balances. And the global load balances share information across 15 data centers about what the estimated weight time and Q times are for every single model instance. And there could be 50 model instances deployed in every single data center. And this information is shared across all the load balances in real time every 100 milliseconds or so to enable routing decisions to be made. One it it is not the easiest to make these routing decisions because we do need to estimate what output generation lengths they are.

Unlike typical API requests, you don't know how long an end to end request is it's going to execute for because you don't know how many output tokens are going to be generated, right? And we take some sampling and we sample from the available backends bucket the TTFT and route the request to the optimal model instance deployed in a specific data center. There are also a lot of checks along the way including for rate limiting um obviously tracking and auditing different usage events as well in a bit more detail. You know we we bucket things by TTFT to route to the best model instances in the best regions.

We apply some priority for different types of customers to make sure say our enterprise customers get faster traffic. This is done across multiple ingress paths into our different into our different clusters and we need to enforce sort of global rate limiting to ensure there's no geo arbitrage to get around rate limits and why rate limiting is important I'll come back to a little bit later on. Another key aspect of serving traffic around the world is identifying the right model mix at different times of day in different regions or even week to week. We see diff demand for different models varying and it's important to be able to deploy any model to a specific region pretty quickly and we do this through a declarative very simple manifest that quickly.

So within a minute or two after committing and merging some code config, we can deploy a new model into any region around the world. So minutes from merging to serving traffic with the appropriate canary testing and warm-up for each of the model instances. Now another question we get often is how do we get models to run on our custom silicon? Typically we take open weights from hungface uh and the pietorch reference implementations and we compile it into our gro tensor operators our dialect into mlir schedule it partition it across different chips. Uh we run different presets to enable this and that gets compiled into input output programs or byte code that executes on our custom hardware fully compiler scheduled execution and software scheduled network.

Uh so we get very extremely predictable latency performance for every single request. Now why with a popular developer platform we attract a lot of abuse and fraudulent behavior as well and you can see that uh the the attack vectors are getting more and more sophisticated and the number of abuse fingerprints we're picking up abuse signals continues to increase on the platform. So it's something we uh do need to monitor very carefully with rate limiting and other Now just two more slides for me. Uh in thinking about what the largest enterprises that we work with are looking for in an inference stack in 26 and 27.

Uh large enterprises are increasingly looking for dedicated compute capacity. Data residency continues to be an important topic. Uh for as models increase in size these large models decode latency continues to be something that people pay a lot of attention to um and the unit economics associated of large model deployments there's also a range of sophistication in uh large even for AI natives AI companies some want oneclick deploy some want managed service some want bring your own models bring your own weights some want their own inference stack so there's a quite a heterogeneous uh demand for different types of inference services moving forward.

Now the last slide for me is a little bit about what LPU based decode looks like. I don't know if anyone watched the Nvidia GTC uh speeches earlier this year where Nvidia CEO announced the Vera Rubin plus Gro 3 LPX system. Um the key idea behind that is this aggregated inference where you run prefill and a number of the layers on GPUs and you run decode maybe the on um LPU like chips and we going forward we do see heterogeneous compute being much more common and the way to achieve better unit economics better speed and better performance uh of course be and that will need to be aligned to the ecosystem in the models that are compiled onto this hardware and run.

So, that's a little bit about what I wanted to share. I hope you enjoyed learning a little bit more about Grock Cloud. Um, and we've got some links here on how to get started as well on our developer platform. Thank you. &gt;&gt; Thank you, Andrew. And up next, I would like to welcome to the stage Daria, who is the head research scientist at Cerris. Uh she is a person behind designing many of thee recipes at Cababus and she's going to be talking about at scale from GPUs to wafer scale Hi everyone.

Scaling MoE Training Past GPU Communication Bottlenecks

Daria Soboleva, Head Research Scientist, Cerebras

Summary

  • Daria (Cerebras) frames MoE training against the LLM scaling arc: GPT-3 (175B dense) -> Llama-3 (400B + Chinchilla token scaling) -> DeepSeek-V3 (671B total / 37B active).
  • MoE shifts the scaling law: 32 experts buy ~5% loss improvement at constant FLOPs, or match dense loss at one-third the compute.
  • On GPUs, expert parallelism forces two all-to-all collectives across physical wires, dominating runtime and making MoE empirically slower than dense — a hardware-fundamental bottleneck, not a software fix.
  • Cerebras WSE sidesteps this with 44GB SRAM (vs B200's 126MB L2) and orders-of-magnitude higher memory bandwidth, plus MemoryX weight-streaming nodes that let a single chip train trillion-parameter models with no model parallelism and no inter-chip comms.
  • Second problem — low arithmetic intensity in MoE layers because experts are tiny — solved by Batch Tiling on Attention (BTA): decouple attention's small-batch requirement from the MoE layer by looping attention G times and concatenating into a larger MoE batch.
  • Empirical Qwen-3 results: baseline Cerebras MoE was up to 7x slower than dense; with BTA, MoE runs at dense speed, so 671B DeepSeek MoE executes at the throughput of a 37B dense.
  • Closing claim: MoE is the fastest path to AGI, and Cerebras is the only substrate where the theoretical MoE compute-efficiency promise is actually realized.

Why this matters

Soboleva's technical argument is the receipt that mixture-of-experts has a hardware-fundamental bottleneck on GPUs: expert parallelism forces two all-to-all collectives across physical wires, which is why MoE empirically runs slower than dense on GPU substrates despite the theoretical 5% loss improvement at constant FLOPs (or matching dense loss at one-third the compute). Cerebras' wafer-scale chip (44GB SRAM vs B200's 126MB L2) plus MemoryX weight streaming plus their batch-tiling-on-attention technique restore the theoretical MoE promise — a 671B DeepSeek MoE running at the throughput of a 37B dense. Since every frontier lab is going MoE (DeepSeek, Qwen, GPT-5-class), 'GPU all-to-all is a physical-wire problem' is the wedge case for non-Nvidia training capacity.

Slides (12)

Slide notes

  • What is Mixture-of-Experts (MoE): dense Transformer block (FFN + Multi-head Attention + Norm) replaced by an MoE Layer with a Router dispatching to specialized experts E1..EN; same input flows in, only relevant experts activate (code tokens to coding experts, math to math experts) [t=414:21]
  • Source cited: Soboleva, D. (2025) 'MoE Fundamentals: Sparse Models Are the Future' [t=414:21]
  • Cerebras WSE vs GPU (Architectural Differences): B200 GPU = 1,600 mm2 silicon, 126 MB SRAM (L2 Cache), 8 TB/s memory bandwidth - shown as the small comparison chip [t=417:29]
  • Implied contrast: WSE is wafer-scale silicon with on-die SRAM measured in tens of GB vs B200's 126 MB L2, eliminating HBM bandwidth bottleneck for MoE [t=417:29]
  • Talk framing: 'MoE at Scale: From GPUs to Wafer-Scale AGI' - Daria Soboleva, Head Research Scientist, Cerebras [t=417:29]
  • Training MoE at Scale - Batch Tiling on Attention (BTA) architecture: input X[GBSH] is split G-ways into G tiles of [BSH], each run through Attention (BTA loop 1 to G), then Concat back to [GBSH] before entering the MoE Layer (Router -\> Experts 1..N -\> Mixing, Expert loop 1 to N), then Add+Norm out [t=420:37]
  • BTA decouples batch-size constraints between attention and expert layers: attention runs with small batch B (memory-friendly), experts get a large concatenated batch G x B (compute-efficient for sparse routing) [t=420:37]
  • Sources cited: Soboleva & Anthony (2025) 'MoE at Scale: Making Sparse Models Fast on Real Hardware'; Soboleva et al. (2025) 'Batch Tiling on Attention: Efficient Mixture of Experts Training on Wafer-Scale Processors' SC25 [t=420:37]

Full transcript (1842 words)

I'm super excited to be here today. I will talk about how we train mixture of expert models at scale on Cerebra's hardware. First I want to start um a bit about me. Um currently I'm a head research scientist at Cerebras and for the last couple of years I've been researching MOE networks and as a result I have this MOU 101 guide that we published. It basically teaches you how to train and run inference for MOE models Um currently I'm leading um frontier scale training on Cerebra's hardware and before um I worked at the company called Yandex. It's very um known like a Russian Google. Uh I worked there on transformers and on the first transformer that we deployed in the production stack and before that I was at Google working on the speech to text For the agenda today, I would like to start with giving you um an overview of what happened in the LM community for the last few years and how we ended up with MO networks.

Then we'll talk about what is an MO network and how we train them at scale. Um first of all in the LM community we did a lot in the last few years. We started with the GPT3. OpenAI released the model that was 175 billion in size. And in addition to the model, they also released the scaling law showing that as you increase the model size, you're getting better and better quality. Shortly after that, there was a release of Llama 3 series by Meta. They scaled the model further. So, it's 400 billion in size now. But in addition to that, they spent a lot of time figuring out how do you extract the signal from the data efficiently.

So some of you might heard about the chinchilla scaling loss. They suggested that in addition to scaling the model, you also want to scale the token budget. Something like 20 tokens per parameter is considered computer efficient. And so at the end of this, we were able to scale both model and tokens very efficiently. However, if you continue scaling the model size and the talking budget linearly, it becomes very very expensive very quickly. We want to train trillion parameter model sizes on trillion parameter data sets. So the other breakthrough that happened um a couple of years ago was the release of DeepSQ3 model by the DeepS uh company.

That model was larger in size. So 671 billion total primary count but it was very very efficient because it would run at the speed of the 37 billion um active primary dense network. How did they do that? The architecture behind the scene is the mixture of experts. If you look at the decoder block of the transformer network, you'll see that we have different types of layers. We have embedding attention and FFN block. Um if you want to create an amoe network you'll see on the right you just take the FFN block and copy paste it and each FFN is now going to be called an expert you also place an additional network on top which is called the router and the job of the router is to decide which expert should process a particular token.

This way you can continue increasing the capacity of the network. So you can go to 671 billion parameters by adding more experts. But because you only activate a small fraction of them, you can be very efficient and run at the speed of the 37 billion dense network. Now you might wonder, okay, this sounds great, but how does the scaling low look like for these networks compared to the dense? Here I have a plot for you where I'm scaling the number of experts and and comparing the quality of the MO network against the dense network running for the same flops. You can see that you can get loss improvements up to 5% here with 32 experts with literally no increase in compute.

So you get it for free just because the architecture is smarter. On the other hand, you can think it about this way. You can train to the same loss as the dense network with just a third of compute. And here I only have 32 experts which is very very tiny compared to what the state-of-the-art models use. We use hundreds of experts. So you can see how efficient this architecture is. And in terms of the LLM community, we are super excited to have that running at scale because for the last few years we couldn't shift the scaling law as efficient as it it's done with the now. Um we know that it should run faster than a dense network, right?

Based on the theory. However, when we actually run it on an actual device like the GPU device here, we get it slower than a dense network. The m of is lower. So why is that the case? Let's take a look at how we actually implement MO networks on the GPU devices. Each GPU usually has a limited amount of memory. And so if you run a very large network, you have to split it. You have to split the model parameters. For movies, we use expert parallel. Basically, you position different groups of experts on different devices. Um, you can see here expert one, two, three is on GPU 1 and expert four, five, six is on GPU 2.

And you add two additional all to all operations. This is usually done because you also do the data parallel and so you don't know in advance where to move tokens to which devices. So then they can be processed by particular experts. And so these two alltoall operations are very expensive. Majority of the time if you try to profile this uh will be spent on the and unfortunately there is nothing fundamental that we can do on the GPU side to improve that. It comes down to the physical wires. Now I want to show you a comparison between the GPU device and the CS machine. Here um I have the B200 GPU.

You can see that it uses 126 megaby of SROM. It's or L2 cache. It's the available memory on the chip and it's also running at 8 terabytes per second on memory bandwidth. It's also a very tiny silicone compared to the Cerebrus which is the size of the dinner plate. Um, and it has way more SRAM. So you can see that we have 44 GB of SRAMM and we're running on orders of magnify faster memory bandwidth. What it allows us to do, it allows us to actually train a very large network on the chip itself without any type of model paralization. However, if we go beyond the 44 GB of SRAM, we developed a technique that helps us train networks literally like one trillion size on just one device.

How we do that? We add additional memory X nodes to our chip that will be um our weight banks. Basically, it's like external memory where you hold most of the model parameters. To do a gradient update, you're going to stream layer by layer weights from the memory X nodes to the chip, calculate your gradients, and then move the gradients to the memory X node to update the weights. This way, you can connect very large memory banks like memory X nodes to one chip and train one trillion models and beyond that without any type of model operization and without any additional And this is very useful for MO networks in particular because we want to train very large networks.

We want to train a lot of experts and so experts sit on the same memory X nodes or on the same chip and there is no communication overhead. However, when we run MO networks on Cerebras, we actually see the same problem. They're running slower than the dense networks. The problem here is slightly different. Today networks are very different. We want to train a lot of experts that are very tiny in size and because of that we have a problem with arithmetic intensity. So ammo layer compared to the rest of the networks uh moves a lot of weights and it does very little compute per per weight used. Because of that the throughput the speed of the network is worse compared to the dense.

We fix this problem with a technique called budge styling on attention. Essentially, if you want to deal with the compute scarcity, you if you want to improve the uh comput the arithmetic intensity, the easiest way to do it is to increase the batch size. However, if you look at different layers um in the network, if you just uniformly increase the batch size for all of them, some of the layers will actually hurt performance like the attention. Attention is activation memory bound. So, increasing the batch size there will start evicting more stuff into the memory x nodes which is not efficient.

We don't want to do that. Instead, we want to decouple the batch size requirement for attention and the form layer. You can see here for attention, we can keep a very small batch size, the original batch size and just iterate in loops and concatenate the results together into the bigger batch size. You can see that we concatenate G different loops. And now we can throw this bigger batch size into the MO layer. And it's going to be restoring the arithmetic intensity of that layer to run it at the speed of the dense network. And you can configure this G depending on the sparity level. So here I have the results for you the empirical results where we tested different layers of levels of sparity for the quen 3 network.

You can see that baseline without BTA on cerebris run can run up to 7x lower than a dense network which is very inefficient. With the BTA we fixed this problem and you can see that we can we can restore the original theoretical premise of the MO network and run it as fast as the uh dense network. So 671 billion um MO network from deepseek can run at the speed of the 37 billion dense takeaways. Um I want to share with you a few takeaways here from my talk. One is in my opinion is the fastest way towards the edgi. So compute efficiency that comes from that network is really incredible. Unfortunately are not very efficient on the GPUs and they hit some communication bottlenecks.

However, on Cerebras WC, we fully realize the MOE theoretical promise. Thank you. And if you want to learn more, here is the QR code to the MO guide where we talk in details how to train this networks. Thank you. Thank you, Daria. All right, this concludes the first section of our afternoon talks.

GLM 5.1, Open Models, and Long-Horizon Task Reliability

Zixuan Li (Z.ai)

Summary

  • Zixuan Li positions GLM 5.1 as the leading open-source frontier model — close to Opus 4.6/4.7 on artificial-analysis benchmarks, ahead of DeepSeek v4, widely used inside Claude Code / Cursor / Kilo / OpenCode harnesses despite Z.ai not shipping its own harness.
  • Thesis: long-horizon task capability — not raw context length — is the new frontier all three labs (Z.ai, Anthropic, Moonshot/Kimi K2.6) are racing on; reframes 'long horizon' as depth, not duration — keep finding meaningful improvements across hundreds of optimization rounds without losing the goal under interruption (MCP, memory, heartbeats, context compression).
  • Splits long-horizon into subjective-goal (build a website, build a Linux distro — humans set the stopping rule) and objective-goal (optimize a vector DB, CUDA kernel — measurable target, hundreds of failed runs are normal).
  • Demos: GLM 5.1 building a Linux distro with 50 integrated apps from scratch in 8 hours; Z.ai's own model team using GLM to self-optimize CUDA kernels and vector DBs — explicit self-evolving AI-native infra loop.
  • Names three failure modes: goal drift, error accumulation, inability to pivot; prescribes checklists, forced goal-rereading, model-driven self-verification checkpoints, and explicit pivot/give-up instructions.
  • Cites a Z.ai 48-hour hackathon where 7 of 9 winners ran GLM 4.1 overnight while they slept.
  • Recommends Frontier Suite (implementation / research / optimization categories) as the eval that matters now, supplanting SWE-bench framing.

Why this matters

GLM 5.1 is now a credible Opus-class open-weight model that runs inside the same harnesses (Claude Code, Cursor, Kilo, OpenCode) builders already use, which means a non-US frontier fallback is real and the harness layer survives a model swap. The 'long horizon = depth, not duration' reframe, paired with three named failure modes (goal drift, error accumulation, inability to pivot), gives operators a concrete discipline for overnight agent runs: checklists, forced goal-rereading, and model-driven self-verification checkpoints. Frontier Suite is the new yardstick worth adopting over SWE-bench framing for any team shipping long-running agents.

Slides (11)

Slide notes

  • Framing: 'Why does it matter? Because it's useful' — go from finishing a single turn to finishing a project over hours or days; cartoon Z agent given 'Build a habit-tracker app' walks through hour 1 / day 1 / day 3 to a shipped Idea/Build/Test app [t=457:31]
  • Depth-not-duration thesis embodied in the timeline: long-horizon = sustained multi-step project execution, not just longer single responses [t=457:31]
  • Tokens Per Second Per User benchmark slide (Llama 3.1 8B, 1k/1k input): Nvidia H200 230, Nvidia B200 353, Groq 594, Sambanova 932, Cerebras 1981, Taalas HC1 16,960 — \~73x H200; source Nvidia baseline + Artificial Analysis, Taalas run by Taalas labs [t=462:01]
  • Inference-speed point reframes 'long horizon': at 16,960 tok/s/user the wall-clock cost of multi-hour agent loops collapses, making depth feasible without waiting days [t=462:01]
  • Three failure modes of long-horizon agents + human-in-the-loop fixes [t=466:31]:
  • Goal drift: use checklists; ask agent to reread the goal every few steps
  • Error accumulation: force verification after each meaningful step
  • Inability to pivot: ask it to evaluate whether to continue, revise, or abandon the current plan
  • Frontier Suite eval — task categories spanning real engineering work [t=472:08]:
  • Implementation: PostgreSQL 18 on SQLite; Wan 2.1 on MAX/Mojo; Git to Zig
  • Research: FrogsGame Post-Training; PCQM4Mv2 Molecular Gap Prediction; Optimizer Design
  • Performance: libexpat to x86-64 Assembly; FFmpeg libswscale re-implementation; Pyright Type Checking Optimization
  • Eval design signal: tasks are port/re-implement/optimize jobs requiring multi-hour autonomous coding — i.e., the Linux-distro-class 8-hour demo, not toy benchmarks [t=472:08]

Full transcript (3008 words)

Of GM model. So today I will present GN 5.1 and also the idea behind Lar's test. But it's not G.A.I. And G.I. Belongs to Google, not not your company. So why you are called Z. It seems irrelevant. And the point is we were first called in Chinese. So actually stands for And when we found that it's hard for foreigners to pronounce Zhi, we try to make it shorter. To make it Z. Actually Z stands for intelligence. You can regard us as intelligence.ai. So it's the best way to link this Z.I platform to our model and to our services and also I want to introduce GLM to you because a lot of people have used the GLM 4.7, GM 5, GM 4.1 but actually we were the first one of the first companies exploring large models as you can see from this paper.

So we submitted on like some day March 18, 2021. So we began the exploration of all the large integration models back in like 2020. So together with open air and deep mind maybe the earliest labs doing so but we became famous like only in 2024 or 2025 for most of you and now GM has become a brand not just stands for this generic uh general language model like like large language model we also have our own branding and now we currently use architecture outside GLM the original architecture to make it stronger. Stronger, faster and more efficient. Okay. And more about the model. So currently we're pushing the boundaries of open source.

So we are leading the open source models in text arena and colarena as you can see. So I did a screenshot before after deepseek. So when deepseeek they launched v4 they show that they improve a lot but still cannot beat j 4.1. In these benchmarks and also we are quite strong in coding and genetic tasks. So as you can see this is artificial analysis latest benchmark combining three individual benchmarks and you and we are just lag behind GBT 5.5 and Opus 4.7. So its current state very close to Opus 4.6 six but many people use GLM inside clock code cursor kilo code open code so we are not very famous for our harness but like we use other harnesses like they're great and their coding agents can help go make better jobs okay so that's all for GM itself and ZAI now we're talking about long horizon task because today I'm not spending too much time on go but I want you to remember the idea uh and fully understand what long horizon tax actually means.

So have you heard of long horizon task and long running task? If you haven't, so these three labs, they all mention long horizon and long running in their latest post about their model. So G 5.1 we emphasize a lot about our long horiz capabilities and Opus 4.7 also also mentions long runninging skills and for Kim K 2.6 they have beautiful front end capabilities but they also emphasize a lot their long horizon capabilities especially coding capabilities. So long horizon has become very popular. Why? So why matters to you? Why you listen to this idea behind like behind the model? So we can share a lot of things how to make websites, how to make slides, how to use GIM for Excel but why long horizon matters.

So the first thing is it's useful because before the era of long horizon you can only achieve like one to 10 task at a time. When you go to sleep, you don't have task to do because your agents can only finish it in 30 minutes and you have eight hours. Your agents cannot do anything. But with long task, everything becomes true. And another thing is with the evolution of open cloud heras, there are a lot of agents that has heartbeat that can interrupt your task. Sometimes you have memory, you have a lot of things going on. Have MCP they can interrupt your workflow. So unless your model has long horizon capabilities they can stick to the original goal.

So if they cannot stick to the original goal they'll follow the latest instruction and totally forget what you are doing right now. So long horizon capabilities make this happen. And a fun fact very interesting story. So in our latest hackathon we had a 48 hour hackathon. It's the first time we have a 48 hour. So there's a night between the two days. Most of the participant choose to run G 4.1 during their sleep and actually they made it. So seven out of nine winners chose to run the task during their sleep and it's a it's a great so I use a graph to show this. When you sleep maybe this year your agents will continuously work every time your agents gathering and discuss and finish the work for you.

And also the second reason why I have to introduce the idea of long horizon to you is it's hard. So not just useful because if it's useful like there's no need I I I speak here you can use it freely as ZAI you can try your best you can run it whatever you want for eight hours but it's very hard because the first thing is many people think long horizon as long context window but actually G 4.1 only has 200k context window so where's the gap First thing is GM 5.1 is really strong not because it contests is very long because it can understand the context it can understand your plan and your memory to better reflect the outcome.

When you use C code sometimes you cannot use like one 200k but the compress uh the context window may be compressed quite often. So you need to stick to the original goal. And the second reason is so even some models claim they have one million context window but when you use like near 500k it forgets everything they only stick to the latest latest guidance and forget the original plan or they they don't follow what's going on in the cloud MD and the second reason or the second misconception is uh some people think if I give enough instruction beforehand. So at least all the instructions it it may follow pretty well because there's no need.

The model has the long horizon capabilities. I have long hard capabilities. I can instruct it to do it in the 100 runs but actually one model was not trained in this aspect. It doesn't has it doesn't have enough capabilities to stick to your plan. So it will try to do whatever they want after certain pattern and later on we'll show you the the story. And the third misconception here is that many people think the longer the better, right? So what people want to model lab post like I can run eight hours maybe another lab shows I can run 12 hours 24 hours like a day seven days but from my point of view that doesn't make sense because we have super fast inference right now.

Yeah. So as you can see there are a lot of inference providers that can provide TPS more than 200 and the latest technologies the hardcore the model inside the chip they can inference near 17,000 tokens per second. So the time doesn't matter, right? The if you think about time, you you use the latest techniques, you only need to run like one minutes. It doesn't make sense to run eight hours, right? So what actually long horizon the long means is not about time, it's about a kind of depth. So as we hear not longer but deeper. So long horizon actually means a capability to keep finding meaningful improvements.

Yeah. So you have to make improvements but these improvements are meaningful. For example, so if I have 10 followers on X, I want to collect all their information. So I give a prompt. So scrape all the data of these 10 followers. So those are one scenario. But if I want to scale, I scrape 100, 1,000, 10,000. The mission doesn't change very much, right? So you have to make very meaningful changes and improvements. So what actually are long horses task? So you talk about the idea of long horizon, you you talk about what are wrong, but so what are the right thing? One need to be care categories. The first category we call it subjective goal.

So in this first category you want to create a website, you want to create create a system. There's no clear metric of what the best website is, right? So you can let the model run infinitely. But where it stops that depends on your capability, your judgment, not the model's judgment. And the second category is that this the scenario requires a objective goal. For example, you want speed, you want price, you want everything related to a certain figure. So we have two categories and for each categories there are completely different mechanisms for we to optimize both as a model and also as a human.

Oops. There's a video, but there's something going wrong. I'll try to make it happen right now. So if it's not fixed in 30 seconds, I suggest you to look at the X of Z.AI. So actually we have this is a video of how we built a Linux system from scratch from zero to one in eight hours and within eight eight hours it doesn't just adding apps. It first create a layer to let all the apps can be integrated into the system and and then it polish all the interface and then test all those apps and finally add 50 apps to it. So that's what this is supposed to be but like unfortunately we cannot present here.

So maybe you can search uh G 5.1 blog and there will be a a comprehensive illustration of this task. So why humans are needed? If this model is super strong and it can finish almost anything. So why we are needed? Because I can go to sleep. I don't need to I don't need to instruct the model, right? Because when I go to sleep, I I let it finish a Linux app and after I got up so it already there. Why why do I need to join this event and learn how how to use long horizon tax? Because model can make mistakes and it make mistakes quite often. There are three major mistakes a model can make. The first one is the model may not stick to our original goal.

If you do a prompt that let the model optimize for five times, it may behave perfectly. But if it let model to optimize for 600 times, it may totally forgot the original goal. Right? Because the model always the attention they they they care about every single tokens. So sometimes you when you talk about Linux and then you talk about iOS the the model totally forgets oh you are doing Linux app or iOS app that's quite often to handle this I suggest everyone or recommend you to have a checklist so whenever you do la tax try to prepare checklist it's the best way for for your model to stick to the original goal and you have to ask it to reread the goal every few steps because you have many steps, right?

Because when you optimize for only 10 minutes, you don't have many steps. You you don't think it's very important, but you have to manually instruct it to reread all the instruction pretty carefully. And the second thing is error accumulation. So if you find the model makes makes a mistake in the 400 runs let's say so it doesn't impact a lot but when it goes to 800 runs it can actually break all the things. So that's called uh error error accumulation. So to let it not happen quite often, you have to verify not by yourself but you have to instruct the model to verify itself. Uh from from zero to one to 100 you need to have several checkpoints.

So when we train model we have checkpoints but when you run horizontax it's similar you have to set several checkpoints for yourself and for the model to check itself. And the third thing is models were trained to push very hard, right? Because if you want the model to do this, it will continuously and sometimes in the loop keep doing that, keep doing the single thing at a time. But it's hard for them to pivot. So the model never gave up sometime. So you have to let the model give up or pivot if they found something very wrong. So also checklist is great is very helpful here and they have to whether by yourself or by the agent whether to continue to stop to revise to do anything that's related to your task.

So those are the suggestions for the subjective goal type of um long horizon task. So that's what people can do and I think a lot of people are building their apps or you are doing similar stuff. So that might be helpful for your and another thing. So it looks harder because that's what objective goal is about. So we have a very strong case. It's called like optimizing a vector database. I believe not many people of you have optimized a vector database. Even our researchers or or the people who is responsible for training haven't have access to this do domain knowledge but our model did. So we start from zero and we let the model to optimize for itself and for 100 runs and finally they got here.

So have a very meaningful improvement at 100 runs and we do similar stuff. So from zero to 100 round to 200 run finally you go to 600 rounds we have basically six to eight scientific findings. So the model pivot a lot the first at first they use technique one and they start to use technique two and they use technique four. So I want you to mention these failures. So actually these crosses like means failure. So when you look at these 600 So basically most of them failed right. So when you talk about long horizon task actually it doesn't mean you succeed every times just like life. So you sometimes succeed sometimes fail and in the circle area all the all the optimization failed.

So for long horses tax or long horses models the critical part is the model can reflect can plan can change the ideas or can optimize improve can continuous improve itself to a better way. So that's what the future optimizations look like and for this type of task maybe it's very hard for you and maybe it's very hard for me. I highly recommend you to look at evaluations. And here's my favorite evaluation currently. It's called Frontier Suite because we all know Sweet Bench. We all know SweetBench Pro, but Frontier Suite is a bench that's trying to assess the capabilities of long horizon task including both subjective goal and objective goal.

And it's their category. So they categorize long task in three ways not not just by subjective goal and objective goal. So first is implementation. When we talk about implementation you start from zero to Here's our three examples and I highly recommend you to look at their website. It's all it's beyond three tasks and when you want to build an app when you want to do some web coding stuff agentic stuff is pretty much same like uh implementation and the second one is So actually trading is type of long hardening task you have to learn from your previous failure you have to learn from a lot of things you have to do the research for the market.

So a lot of things outside coding belongs to long hor task. So long horizon doesn't belong to what engineers do. Traders, scientists can also use long horses and task to do things. So that's what research stands for. So you can use long horizon to explore a lot of things. And the third way is optimizations. I have shown you the capabilities of it. So currently our mo our model teams is using AI is using GLM to to optimize CUDA kernel to optimize vector database. So when we talk about self- evolving when we talk about continuous learning model teams already a team an AI native team that can use model to improve itself and improve the inference of the Okay, I think that's all for today.

And here's my LinkedIn and axe one. I doesn't post on LinkedIn, but there there's a profile of me and on X. Uh I post a lot. I'm quite active on X, but there's no profile, so you better like scan both. And I think that's all for today. Welcome to all the questions. Yes. Yeah. Feel free to reach out me through these two platforms. Thank you very much. Thank you so much. Um, next up we're going to switch things up a little bit. Um, we're going to talk about voice agents. Now obviously we've talked about design and different interfaces as part of it uh as part of the conference so far and we wanted to look at how voice might be one of those paradigms and to that end no one better to hear from than from Boris Starkov who's a growth engineer with 11 Labs.

So 11 Labs obviously is one of the leading companies in the space. Um and Boris will be talking about the speech engine and what makes an agent conversational. So without further ado, Boris. Um, hi everyone. I'm Boris.

Speech engines, turn taking, and conversational voice agents

Boris Starkov (ElevenLabs)

Summary

  • Boris Starkov (ElevenLabs growth engineer) walks through the production architecture that makes voice agents feel like real conversation instead of the stitched STT-LLM-TTS pipeline most coding agents ship.
  • Input side: 20ms-chunk voice activity detector (VAD) gates expensive ASR; a separate context-aware turn-detection model decides when the user is actually done, using heuristics (credit card spelling, trigger words) to avoid cutting people off.
  • Biggest latency unlock: speculative generation — the moment the turn model thinks the user stopped, the LLM starts drafting; false positives are cheap because they just get cancelled.
  • Output side: tokens buffered into 5-7 word phrases so synthesis stays low-latency but never jumpy; the next phrase synthesizes while the current one plays.
  • Run a fallback TTS in parallel (TTS cascading) for near-100% uptime; handle interruption with corner-case heuristics (sub-40ms = cough, first 200ms = echo, 'mhm/uh-huh' = active listening, not interrupt).
  • Reveal: ElevenLabs is launching Speech Engine next week — encapsulates the whole stack so any text-only agent (their chatbot, Claude, an OpenAI clone, Hermes) can be plugged in and become conversational; explicitly not STT+TTS but a 'proper conversational engine.'

Why this matters

Voice is becoming a thin pluggable layer rather than a moat: ElevenLabs' Speech Engine encapsulates VAD, context-aware turn detection, speculative generation, phrase buffering, TTS cascading, and interruption heuristics so any text agent can become conversational without owning the speech stack. The takeaway for builders is that the durable edge is the agent and tool graph above the speech layer, not the speech layer itself. Sub-second turn-taking is now table stakes for agent-to-human authorization, support, and KYC workflows.

Slides (7)

Slide notes

  • Turn taking != VAD: VAD only detects voice, doesn't know if turn is done [t=478:30]
  • Transcript verdict examples: "I think..." = not done; "...that's all." = done — needs turn-end model [t=478:30]
  • Sentence buffer: tokens stream in ("I'll", "look", "that", "up", "for", "you", ".", "One", "moment", ".") [t=482:05]
  • Buffer flush rules: on sentence end, comma + \>=5 words, or stream end — emits phrases like "I'll look that up for you." / "One moment." [t=482:05]
  • Introducing Speech Engine: full stack between User -\> ASR -\> Your Agent -\> TTS -\> User [t=485:41]
  • Input-side components (blue): Frame VAD, Background-speech VAD, Streaming ASR, Turn-end model, Slow-turn keywords, Speculative fire, User-timeout turn [t=485:41]
  • Output-side components (green): Sentence buffer, Streaming TTS, TTS cascade, Language switching, Barge-in [t=485:41]
  • Speech Engine wraps everything except the agent itself — ElevenLabs ships the conversational layer so you bring only Your Agent [t=485:41]

Full transcript (1770 words)

I work as a growth engineer at 11 Labs. 11 Labs is a frontier voice AI lab. Um we research and build applications all across voice Um we're also particularly excited and we strongly believe in voice as the main medium for human to agent interactions and matter of fact we're very happy that the industry starting to catch up with that uh vision. Take coding agents for example. Uh most if not all of them actually have some sort of use voice mode uh button. However, if you actually use it uh it works the following way. You you start uh talking, you talk to it, then you wait for it to get transcribed, you wait again for the for for the agent inside and then you wait for the third time uh for actual speech synthesis part.

So sure this is a voice input and this is a voice output but this is not And today I want to talk about how to improve this uh architecture to make it feel much more like a natural humanto human uh conversation. We're going to keep the core architecture the same but we're going to add a lot of small improvements that combined together make a huge Um I'm going to start with the improvements to speech recognition part and then in part two I'm going to proceed to some improvements to to cover some improvements in the speech So probably the most fundamental uh the most fundamental piece of puzzle here uh is um is called a voice activity We take the uh audio stream from the user and then split it into chunks of approximately 20 milliseconds.

And then we have a very tiny, very efficient and very cheap model that can take that can tell you for every chunk whether someone is speaking or not. Not only this is very helpful later on downstream to actually uh understand what's going on, whether someone is speaking, who is speaking, whose turn that is, it also helps us to save a lot on compute because if you know that in some chunks nobody is speaking, we don't need to run a more expensive ASR model on those. It's very important to understand that detecting silence and detecting the end of the of the turn is not the same problem. For example, uh agent can ask me something and I respond with I think um there is a lot of silence but it's not the end of my sentence.

I don't expect the agent to interrupt me at this point. That's why uh detecting silence is not enough to accurately predict uh when the agent should start speaking. So here we trained another model again a very smart uh turn detector model that takes into account not only voice activity but also the actual context of what's been said before to predict whether this is the end of the sentence or uh the speaker the user is going to say something else. Like in many other steps, by the way, here we use a bunch of heristics. For example, if the user is um spelling out their car details or their credit card details or their email or they say one of the trigger words that um we have, we use this as a very strong signal that uh likely there will be some sort of silence and likely that silence doesn't mean that the user is done speaking.

This this this model is crucial in u the following uh slide. So one of the biggest unlocks one of the biggest wins we can achieve in terms of improving latency and turnbased models um is is the following one. So to understand this uh let's let's think about how the humanto human conversation goes. You talk to a friend. Let's say your friend is talking to you. They they're talking talking then they stop speaking then you wait for approximately a second to make sure that they don't have anything to add and only then you proceed to reply. Unfortunately agent can't afford to wait for a second because it also needs u some time to generate the response.

That's why what we do we do a speculative uh turn which is immediately after our model thinks that the the user stop likely stopped speaking at the very same moment we speculatedly start generating the response. Our model is quite smart and so most of the time it's a right call and the response comes much earlier and it feels the latency is much lower. It feels much more natural and maybe sometimes it would be there would be a false positive. That's not a big deal because then we just send the cancellation to the generative model and continue listening. Sounds like a lot but that was only the first part.

Uh now uh a little bit of how to improve um syn synthesis part speech synthesis part. So the agent uh sends us the tokens and the user expects uh sentences in terms of speech like speech sent yeah speech um we can't really afford to wait for the whole sentence before we send it to the uh to the speech generator model because then the user is going to wait in silence. We also can't really generate tokens one by one because then some tokens will be generated very quickly and other tokens would take some time. The whole generation would feel very jumpy, very laggy and not stable. So we take something in the middle.

We make a buffer for small phrases of five, six, seven words. We collect tokens together and then flush them to the generator before the whole sentence is constructed. This lets us have the best from both worlds. We have stability and low latency. This is also quite efficient because while the current phrase is being played out to the user, the next phrase is already being synthesized and the phrase after that one is already being constructed in the buffer all at the same time. We also use um cascading in many parts of our um of our models and tools. For example, uh here I'm going to talk about TTS cascading which is we have a texttospech model um that generates the response and at every single time it's running we also have a second model uh fallback model that's ready to pick up whenever the first one fails.

So even if uh the current mo model fails or have some sort of crash for some reason, the user is never experienced to it. Um ensuring almost 100% um uptime. So users are never um experienced to experience crashes, errors, bugs, etc. Um, this one could actually be a whole talk of itself, but a very important part of uh making your turnpaced model feel truly conversational is handling letting the user interrupt the model. That comes with a with a lot a lot a lot of uh different um corner cases, horistics, etc. Here I'm going to cover just a couple of them. So imagine you're a model and you're trying to detect that the user is interrupting you.

So first uh if the if the interruption is very very uh small very short couple of frames 40 milliseconds it usually means that it's a cough or a noise or perhaps a false positive coming from the voice activity detector. That's not an interruption. Another example is if the interruption is coming in the first 200ish milliseconds, that also likely means that it's just an echo. Another one, uh, for example, if the user is saying, "Yeah, mhm. Uh-huh. Okay." That's an active listening. That's not an interruption either. And there is a lot of small corner cases like that here. Well, um, let's actually zoom zoom out here a little bit.

Um you've built an agent and you've came here to listen to this uh talk thinking that you're going to make it um conversational and now with all this uh little steps you might feel a little bit intimidated by how how complex it is. Well, good news, we've got you introducing speech engine. Um speech engine is actually uh the new the new product that we we have. We didn't publicly announce it yet. It's uh we're going to start testing it starting next week. The way it works is uh we encapsulate all of the complexity related to making uh things sound fully into this product while you can bring your own agent and very easily plug it in.

So it could be your uh chat bot or your open clone, nano claw, uh hermas agent, whatever um any agent you can of any complexity you can just simply plug it in. And remember this is not a speech to text and text to speech. This is a proper conversational engine. We're very excited to see millions of silent agents uh becoming conversational. Uh, keep an eye for for update on this on our socials again and we're going to start we're going to likely start um testing it publicly next week. And thank you very much. Thank you so much. Next up, we have Jackman from Prime Intellect. He's a founding research engineer.

Jackman, you can set up. Um, and he'll be talking about continual learning for longunning agents, agents that keep getting better. So, this has been something that's been a recurring theme last couple of days. We've talked about software factories. Uh, ZAI talked about longunning agents. This has been a theme that has come up time and time again. And I think the question that keeps coming up is if an agent runs for too long, how are we making sure that the agents are getting better or learning as they go along? Uh because there's no it doesn't make any sense if the agent just runs for 20 hours to put out stuff that doesn't work right.

So Jackman works for Prime Intellect. Prime Intellect is one of the pioneering companies in the space. If you want to train your own models, um if you want to work on these um environments where you can test and improve things, uh they have really cool tech to work with. And Jackman, whenever laptop's ready, the floor is yours. &gt;&gt; Yes. Uh thanks, Agram. I actually changed the the topic of my talk, but it's still like related to continual learning and longunning agents. Just I picked a catchier title, so you'll see it when it appears on screen. So yes,

Recursive Language Models, Memory, and Long-Running Agents

Jackman Ong, Founding Research Engineer, Prime Intellect

Summary

  • Core claim: long-running agents fail because long-context reasoning collapses past a depth threshold (MRCR, Graph Walks show monotonic degradation), not because models lack capacity.
  • Fix is Recursive Language Models (RLMs): pass a reference and let the model write Python (loops, slicing, sub-agent delegation) against context like a data scientist in Jupyter, rather than stuffing the prompt.
  • Claude Code power users already simulate this manually — prompting agents to write to files instead of dumping tool output, because compaction is unreliable.
  • Alex Zang's 'mismanaged geniuses' thesis: models are capable, scaffolding is the bottleneck; since models can write great scaffolds (Cursor, Claude Code are vibe-coded), they should generate scaffolds dynamically at inference.
  • Out-of-the-box RLMs underperform because base models aren't trained on the pattern — prompt engineering closes the gap, so RL fine-tuning on RLM strategies should win outright.
  • Prime Intellect's platform trains and serves open-source models (GPT-OSS, Llama, Nemotron, Qwen) with experiment management and rollout inspection.
  • Case study: Ramp Labs trained a small Qwen to beat Opus 4.6 on an Excel retrieval task — cheaper and lower latency.
  • Second wedge: data vendors are bundling in-house training so they can show reward curves to justify multi-million-dollar data deals.

Why this matters

Jackman's thesis lands the harness-as-edge argument with hard data: long-context reasoning collapses on Graph Walks and MRCR as context grows, so the durable pattern is Recursive Language Models that pass references and let the model write Python against them like a data scientist in Jupyter. The Ramp Labs proof point (small fine-tuned Qwen beating Opus 4.6 on Excel retrieval at lower cost and latency) is the economic argument for narrow-domain fine-tuning over default 'just call the big API.' For builders shipping agentic skills, this reframes design discipline: handle-passing and file-based intermediate state beat context dumping, and RL on RLM strategies is the next training frontier.

Slides (17)

Slide notes

  • RLM context pattern: progression from `inline context` \<\< `file reference` \<\< `var reference` — User message moves from a wall of pasted Lorem-ipsum-style text, to a file listing (prompt.txt, pasted1.txt, pasted2.txt, tools.json, skills.json, soul.md) with Assistant call `read(prompt.txt)`, to typed Python variables (prompt: str, pasted1: str, pasted2: str, tools: list[Tool], skills: list[Skill], soul: str) with Assistant call `print(prompt[:200])` [t=490:29]
  • Slide framing: instead of stuffing 38K frames / 100+ page docs into the context window, treat long context as an external environment the model writes Python against — RLM as a drop-in that dramatically simplifies long-form data tasks [t=494:05]
  • RLMs in the wild collage: Gabriel Lesperance — RLM \<3 GEPA, using RLMs to improve RLMs with GEPA; Mohammed Alshehri (@aizhang, Kraska, @aleinteraction) reimplementation — key insight: "treat long context as an external environment, not something to stuff into a context window"; applied to video understanding [t=494:05]
  • Agentica Opus 4.6 — new ARC-AGI-2 SotA 85.28% using an Agentica agent (\~350 lines) that writes and runs code; "A Data Scientist RLM That Lives in Your Program" — process DataFrames with RLMs and DSPy, drop-in for existing workflows on 100+ page documents [t=494:05]
  • Michael Siu / @michaelsiux — Monolith, RLM-as-a-service MCP tool, won TreeHacks 2026; AI coding agents lose context between sessions, RLMs let agents reason over arbitrarily large contexts with persistent memory across sessions; akira_xcore — "We built RLM for coding. And it f\*cking rocks. Swarm-native agents are here to stay." [t=494:05]
  • QC for Data Vendors (Sean Cai @SeanZCai, May 8 2026, 15.7K views): frontier-lab QC bar for RL data has become more load-bearing during procurement; contract non-renewals trace to vendors running zero categories of active testing, shipping without verifier FP/FN audits, no pass@k distributions across 3+ models, no contamination story [t=497:41]
  • Sean Cai cont.: as RL/SFT data markets formalize into 2027, banal synth data (garbage-in, QA aggressively, garbage-out) gets cut; small set of vendors with internal QC infra (mostly research-dense teams) pricing 2-3x commodity peers for nominally similar RL-data tasks [t=497:41]
  • Reward curves shown: prime/context-tools/context_efficiency_reward climbing 0 to 0.6 over 0-140 steps; top reward curve oscillating 0.2-1.0 — illustration of RL training dynamics on context-efficiency objective [t=497:41]

Full transcript (2120 words)

Uh hello everyone. My name is Jackman Ang. I'm a founding research engineer at uh Prime Intellect and today I'm going to be talking about reinforcement learning and recursive language models. So uh we've heard a lot about agents today and all the exciting things they do. Uh, and I find it quite crazy that just like two years ago, back in 2024 when cursor agent had just released, if an agent ran for more than 5 minutes, you would not expect it to be doing anything useful beyond that point. And yet here we are like in 2026, two years later, and where we just like let the agents roam free while we sleep, uh, going for hours and hours consuming millions and millions of tokens to do some pretty remarkable things.

And so I think it's not a question especially like in this audience that the models are really useful. And so the questions become more economical ones. Uh questions like can the models do my task reliably? Can the models do my task efficiently? And can the models do my task quickly enough that I can deliver the user experience that I want for my product. And so today I'm going to be making the case that the solution to all of the above is that you should be training your own uh language models and in particular you should be doing reinforcement learning to do so and also using RLMs. So first uh what is the issue with longunning agents?

So I believe that anyone who's used the agents be it like clawed code or codeex or any of the claws like you know that the models aren't actually that good at long context. Just because your model accepts 1 million tokens doesn't mean that it can reason across the 1 million tokens. And this is pretty apparent in the benchmarks. So if you look at any of the big model providers today usually in their model card they'll have a section called long context and there will be two benchmarks in there. The first one is MRCR. This is needle in the haststack. And basically this is uh testing the model's ability to retrieve a particular piece of information in a long series of text.

And you can see that as the context length gets longer, the models get significantly worse at this task. And people who have been working on agents kind of know that like uh this information retrieval thing is kind of nice to measure, but that's not really what we want to know about the models, right? We want the models to be able to reason across the 1 million context. And so a very popular benchmark that's appeared recently is graph walks. And graph walks is basically we pass a list of nodes and edges into the prompt and basically ask the model graph questions. So things like uh list all the parents of X or uh do a BFS on Y and list all the children.

And you can see it's the same story. As the context length gets longer, the models get significantly worse. But what if instead of passing the entire context into the context window, we just pass it a reference to the And I think this is pretty intuitive if you're like a data scientist or if you've done any amount of data science and you've done exploratory data analysis in Jupyter notebooks because like you don't pass your entire CSV into the Python code, right? Uh you usually do like okay I do my classic uh data science imports and then I define a data frame and then I'm doing these like code snippets to slowly manipulate my data frame try and figure out the structure of my data what the distribution is and then I figure out okay what things can I do with this data and if you think about designing agents in this way uh a lot of things become very easy like context chunking becomes very easy tool calls becomes very easy sub agent delegation becomes much much easier and the reason is that your orchestration agent now doesn't need to reproduce the context autorecursively uh correctly right it just can pass it as a variable and so uh why stop at just uh variables right um why not have the entire grabag of um programming structures uh so say for example you need to process uh you have task that needs to process 10,000 If you were to do this in like the legacy language models, basically you would need your orchestration agent to do 10,000 sequential tool calls correctly and like not just do the tool call correctly and pass the context correctly.

You also need to pray to like the summarization gods. Please please that when the model does its compaction that it somehow remembers all the various things it did and can somehow still remember uh like where it is even in like um calling all these sequential tool calls. But if you had just done it as a recursive language model, you could well the model could simply write a for loop and just basically do these LLM queries uh these sequential queries in a very simple manner. And so we see that like the people who are really good at using the agents kind of already do RLMs. Like if you meet anyone who's really good at using cloud code, they're always writing these prompts of like, "Oh, please please don't uh put the sub agent to uh uh uh don't put the sub aent output into your context window.

Don't put the tool code output into your window. You'll probably mess up and they'll garble your context uh and like write everything into a file because like people who are very good at using the agents kind of know that compaction doesn't really work and when you see this like you know it's over the the model is not going to recover from and so um any chat uh agent that you can use now like chat GBT uh and claude or like AI studio basically if you try to put a really long series of text into the chat window. Uh they basically always turn it into a file. So like the point being made here is that like people kind of already are doing recursive language models, but they're just not doing the full power of it.

They're only using the variable aspect of it. The fact that you can reference context, but they're not getting the full Python expressivity that you can get if you had a full Python ripple. And so I think it's no surprise that uh people have started to use RLMs for everything. So anything that needs like long context understanding. So there's RLM for videos, there's RLMs for gaming, there's RLM for coding, there's RLM for math. Uh I believe at some point on Twitter there was even an RLM for Epstein files. Uh I couldn't find the tweet. Uh maybe the CIA removed it Okay. And like the uh Alex Zang who is like the first author of RLMs uh he wrote this really nice uh post that I think everyone should read called the mismanaged geniuses hypothesis and the basic idea in there is that the models are already capable enough to do a lot of the tasks you want and the only thing holding them back is the scaffolding.

We don't quite know how to orchestrate these agents. We don't quite know like oh where we should put the memory what exactly it should be doing what are these like sub aent delegations and like the bitter lesson way of viewing this is like why are we making humans do this right we should just let the agents define their own scaffolds like all the scaffolds you guys use today cloud code open claw super vibe coded so it's very obvious that the models can already write really good scaffolds so they should just dynamically write the scaffolds as they're doing the inference Well, it's not so nice right now.

So, um you guys might have like uh seen the slides before and been like, "Oh my god, this is like the the best idea ever." And then like you go home and then you uh try out the RLM repo. Um but you might feel a bit lackluster. And the issue is that what you will notice if you look at the way that the agents do RLMs right now is the agents aren't trained on this scaffold. So, they're not very good RLMs. They don't quite figure out that, oh, they should be doing sub aent delegation. They don't quite know how to do this like context slicing thing but like yeah you should read the blog post but in the blog post basically it's showing this task where if you had just taken the base model with the base uh RLM prompt it doesn't perform very well but with a bit of prompt engineering you can get significant performance gains and you basically always beat the base model and if prompt engineering is enough for you to be able to beat the base model with RLMs what's stopping you from just training these good RLM strategies right into the model itself And so that's what we're trying to do at Prime Intellect.

So Prime Intellect, we're a platform that uh is trying to serve anyone who is trying to train and serve their own uh language models. Uh we support many of the open source language models from GBD OSS to Llama to Neotron and all the quens. Uh we basically have experiment management. So you can see your metrics and also all your experiment configs. And most importantly, you can look at rollouts which is like the most important thing. You can see your failure cases and uh look at your data. Uh we have some pretty interesting users. Uh so I think this was like two weeks ago. Uh Ramp Labs announced that they were working with us and they basically did a project where they trained a a small Quen model to beat Opus 4.6 on a retrieval task for Excel agents.

And not only did it beat Opus 4.6 ICS in terms of accuracy on this task they were interested in. They also could do it more cheaply and they could also do it at lower latency. Another interesting user segment for model training is data vendors. So there's this guy called Shan Chai. I think if you're in the data space in Silicon Valley, you've probably met him before. I think he's basically talked to like every data supplier, every data consumer in the valley. And he made this observation that the distinguishing factors of like which data labs will make it in the future is whether or not they're able to develop in-house training capabilities.

Because these models uh these um labs buying the data, they're not stupid, right? They know that like not all data is created equally. And before they sign like a million-doll deal to buy a bunch of data, they want to know like is this data going to improve my model capabilities or not? And a very easy way for you to do this and very definitive way for you to do this is to simply show reward curves. Simply show that if you have trained on my data, uh then your reward goes up or if you have trained on my data, your agent performs the task significantly more efficiently. So if any of this sounds very exciting to you, uh please check us out.

We're at primeintellect.ai. Uh we look forward to seeing what you guys build. And uh that's it for me. You guys have been a great audience. Thank you very much. Awesome. Thank you so much, Jackman. That was a really, really good talk. Um, next up we have Michelle Julia, who's a co-founder of Blue Labs, who'll be talking about for AI to be emotionally intelligent. Obviously, we've been talking about personalized AI for a while, so this is a topic that's pretty pertinent. But Michelle is also kind of a badass. She's one of the youngest patent holders from Apple. So if you've ever used Find My Find My iPhone or Bump to exchange contacts, the wireless system that runs underneath all of it, she is the patent holder for it.

Uh but today we're not talking about that. We're talking about emotionally intelligent AI. Without further ado, Hi. Hi to

Embedded Emotional Intelligence: Architectures for Long-Term Relationship-Aware AI Agents

Michelle Julia, Co-founder, Blue Labs (ex-Apple Find My patent holder)

Summary

  • Core claim: agents should treat stewarding an ongoing relationship and capturing immediate utility as co-equal objectives, not a trade-off — Blue Labs calls this 'embedded emotional intelligence.'
  • Framing anecdote: at Apple, technical specs mattered less than 10 years of dyadic vendor history when negotiating in Portugal.
  • Finding 1 (Nature Human Behavior 2025): LLMs cooperate well in self-interested games (prisoner's dilemma) but fail at coordination games (battle of the sexes), which is most commercial life; social chain-of-thought boosts cooperation exponentially.
  • Finding 2 (DeepMind bargaining game): Bayesian agents capture ~80% of max surplus by playing hardball, humans split fairly, frontier LLMs (Gemini 1.5 Pro, GPT-4o) are pathologically concessionary — accept every deal, give away surplus.
  • Finding 3 (ACL paper): user state at a moment (anxious in this room, meeting these people) predicts policy better than static personality trait, yet models treat behavior as static.
  • Open research: training LMs to modulate strategic registers (push vs pull), and architecting relationship representations (dyadic embeddings, reflective memory hierarchies).
  • First build is Blue JST, a 'joint state engine' with a dual-reward mechanism holding relationship-building and utility as co-equal; Blue Labs is hiring and collaborating.

Why this matters

Blue Labs makes the case that frontier LLMs are documented as pathologically concessionary in coordination games (DeepMind bargaining study: Bayesian agents capture ~80% surplus, humans split fairly, GPT-4o and Gemini 1.5 Pro give it all away), while most commercial life is coordination, not zero-sum. The state-vs-trait ACL finding (situational state predicts policy better than static personality) is a useful primitive for any persona or agent design treating users as static. For builders working on negotiation, vendor-management, or account-stewardship agents, dyadic memory and register-switching are open architectural problems worth tracking.

Slides (7)

Slide notes

  • Title slide: Michelle Julia, Co-Founder Blue Labs — 'For AI to be Emotionally Intelligent'; boardroom-of-old-men Getty image with speaker selfie superimposed (visual gag re: who gets to define trust and negotiation) [t=502:09]
  • Referenced research thread #1 — Nature Human Behavior 2025 cooperation paper: humans cooperate more with AI partners when the AI signals emotional/affective state, not just rational moves [t=502:09]
  • Bayesian Agent v. LLM v. Human (Qian et al — Google DeepMind): bargaining game where players trade colored chips with private valuations [t=505:21]
  • Bayesian agents played aggressive and got rejected a lot — captured \~80% of max possible surplus despite (or because of) the rejection rate [t=505:21]
  • Humans played in the middle, fairness-driven — neither maximally extractive nor maximally accommodating [t=505:21]
  • LLMs adopted a 'concessionary posture' — offered more than they asked for, got every deal accepted but left value on the table; pleasant, not strategic [t=505:21]
  • Implication: out-of-the-box LLMs are pathologically agreeable negotiators; trust ≠ capitulation; need state-aware reward shaping to recover Bayesian-style EV without the rejection tax [t=505:21]
  • Referenced research thread #2 — state-vs-trait ACL paper: distinguishing momentary emotional state from stable personality trait matters for modeling the counterparty in multi-turn negotiation [t=505:21]
  • Blue Labs JST (Joint State Tracking) dual-reward joint-state engine: tracks own state + inferred counterparty state simultaneously, optimizes a dual reward (task outcome + relational signal) to avoid both the Bayesian-rejection failure mode and the LLM-concessionary failure mode [t=505:21]
  • Closing slide: Blue Labs logo, contact michelle@withbluelabs.ai [t=508:33]

Full transcript (1633 words)

Y'all. I'm Michelle. I'm co-founder of Blue Labs. We're a research lab focused on emotional intelligence, specifically embedded emotional intelligence. Embedded emotional intelligence is the capacity to navigate ongoing relationships where each interaction shapes the trajectory of future wants. So it's not a static state. It treats stewarding the relationship and capturing immediate utility as co-equal objectives, not as a trade-off to optimize. So our research is around what architectures let AI systems do this in a way that humans do. If you take a quick step back, really we're focused on making AI sound and feel human, especially in commercial decision-making processes.

So that's where we're focused on today. Let me ground this in a quick story. So as he mentioned previously, as a grim mentioned, uh I was at Apple before Blue Labs and you know, I was one of the youngest patent holders there. If you've used Find My, it runs on wireless algorithms that I hold the patents for. And you can imagine I am a small Asian Oftentimes in negotiations, the room looked like this. So I was a little bit anxious going into every negotiation. Uh the first one that I went to, we were flown out to Portugal. And the night before I was sitting in the lobby of this hotel and I was very anxious and I was going through all the technical details of, you know, What exactly are we negotiating with these external vendors?

What is Apple's position? How do we talk to them about the tech? My manager then sat me down and said, "Listen, we have an hour to talk about this. Forget the technical details for a second. These these are the past 10 years of history that we've had with this vendor. And this is all the tea. Let me tell you about the relationship that this guy has with this guy and how we've negotiated with this guy in the past and what he kind of looks out for and how he has interacted with our big boss in the past. And this is all the dynamics of this room that you're walking into. And that will serve you way better than just memorizing technical spec.

It was at that moment that I came to a realization that what's important isn't necessarily just the technical utility of a conversation. In most settings, humans require an understanding of longitudinal relationships. And so for me to be an intelligent agent of apples, I need full diadeic context on each vendor and to be able to steward that relationship forward in a way that's beneficial in the long run. So that's a lot to hold for one person to hold, much less for an agent to hold. Most humans actually do this intuitively. You don't need to really think about the mechanisms that much. Most of y'all are, you know, well functioning and well situated.

But it's hard to model and balance relationship states over time in these utility based conversations and relationships. Mathematically, it is hard to prove. So I believe that the unlock of emotional intelligence in this realm is what will sincerely move us the needle for us in adopting AI as strategic and useful emulations of humans. We have built language models that are fluent in the work that humans do but not strategically competent. So taking advantage of these long-term relationship blocks, I believe true corporate function depends on highly nuanced abilities to balance trust and relationship with transaction and negotiation.

And so um I'm very excited about the space and my goal today is really just to give you like a little taste of what the field is, what is the state-of-the-art today, what are people talking about and what are some open questions. And if this is exciting to you too, we can talk more about what Blue Labs is doing later on. So we'll be going over social train of thought and game theory by modality and human behavior and state beats trait. I'll try and kind of touch very briefly on these. So the first one it was published in human nature nature human behavior last year. So basically they played this game right with AI agents where you have the prisonless dilemma which is a self-interested game and battle of the sexes which is a coordination game and their goal was to really see how models behave um in these specific states and what they found is an asymmetric result.

The models did pretty well at self-interested games. So they cooperate when you're supposed to cooperate and you know defect when defection pays but badly when it comes to coordination. And this is sticky because most human is a coordination game, right? When you sit in that hotel lobby in Portugal, our vendor is not trying to defect on us. We're not trying to defect on them. We both want a deal. It's just what kind of deal. So that nuance is hard to capture. Social chain of thought also actually increase cooperation rates. Um and so we see kind of an exponential growth when you're able to model both you and the The second piece is from Google DeepMind.

Um, it came out this year where they played a bargaining game with humans, Frontier models, and a uh Beijing agent, a custom agent that they trained. So, this was, I believe, Gemini 1.5 Pro and GPT40. And what they found is in these three camps of people playing the game where, you know, it's a bargaining game where you're trading chips, the Beijian agents are very aggressive. So they, you know, kind of play a hard ball. They get rejected a lot, but then they get 80% of the max surplus. So this is actually very good in a defined space. Humans are more fair. They give a little bit, they get a little bit.

They kind of want this balance. LLMs are very concessionary. So it's like, oh, I'll deal I'll do any trade with you and I'll actually give you more than what you're giving me just so that I can make this trade. So every deal is accepted. And we see an inability for these models to actually self-balance throughout the game. So the appropriate response here is really that as a human maybe when I first meet you I give a little bit so that we build the relationship and then when when it's coming to a very big transaction then I want to play more of a Beijing game. And so um this highlights that there is a static nature to uh in which agents kind of negotiate.

And the third piece, so this is interesting because it comes from computational psychology and not necessarily CS. Um but the highlights uh the findings are highlighted, you know, along a similar vein. So it's a paper accepted to ACL um on fixed psychological personas on states rather than traits. Basically researchers were asking how well do language models actually capture what who a user is. And what they found is who a user is at a specific time is more interesting and important than the user's like general state. So in this point of time given this relationship I'm a little bit anxious because I'm in a room of these kinds of people or I'm meeting these people for the first time.

These changes in states are actually more important to the users's policy than the underlying users like I am a naturally calm person or I am a naturally this kind of that person. Uh sorry personality trait. So what we found here is that the the static way in which we model personality actually leaves a whole lot of room for So what this means it shows that models can't coordinate over across changing conditions. They treat their own behavior as static and are naturally concessionary. Right? I'm pointing out all these problems to show you that there's so much more that we can do to imbue models with this sense of understanding and sense of emotional So, we have a couple of research directions and I have 30 seconds, so I'm going to run through these very quickly.

One, can we train language models to modulate between strategic registers? When to push and when to pull? Two, what's the most appropriate architectural representation of a relationship? Diadeic embeddings, reflective memory hierarchies. It's an open research topic. And any one of you, if you have an idea, you can you can implement these experiments pretty quickly and whip something up. And so this is a rough estimate of, you know, what we're exploring, we're beginning to explore at Blue Labs. Our first architectural attempt at this is Blue JST, a joint state engine whose core idea is a dual reward mechanism that holds relationship building and utility prioritization as co-equal objectives rather than reducing one to the other.

And like I said, it's open research. It's exciting. It's, you know, we don't have all the answers yet, but if any of this is interesting to you, we're hiring and we'd love to chat. We're actively collaborating across industry and academia and the research is out there for us to get. Thank you. Thank you, Michelle. Next up, we have Jackie Mock, who's the head of applied AI at RA. Now he will be talking about world models um and how do we move from language to physical intelligence um again we're moving into the terrains of physical AI embodied AI um not quite the robotic side yet but more world models world building sides of it um so once Jackie is set up we're going to be ready to &gt;&gt; All right.

Hi. I'm

From Language Models to Physical Intelligence and World Models

Jacky Mok, Head of Applied AI at Reka

Summary

  • Core claim: world models must predict the next action, not just the next token or next frame — Reka's path runs through VLMs into action-predicting world models.
  • Production today (surveillance, robotics) stacks classical CV for cheap detection/tracking with VLMs on top for reasoning and alerting; neither alone is sufficient, and the harness around them is what makes them useful.
  • Two paradigms to fuse: language/VLM models predicting the next token over visual embeddings, and diffusion video models predicting the next frame (now repurposed to generate zero-shot robot arm trajectories without prior demonstrations).
  • Core gap: VLMs are bad at physics — they hallucinate objects shrinking, disappearing at frame edges, or violating motion continuity, because they sample frames rather than process densely and reason about physics in text space.
  • VLM-judges-VLM evals compound the blind spots; Reka is releasing a synthetic physics-realism eval set to expose failures and let other labs train better embodied models.
  • Recap: VLM + diffusion + harness = path to physical AI, with evals as the gating instrument.

Why this matters

Reka's framing is that production vision deployments are CV + VLM + harness orchestration, not a single monolithic model, and that VLMs still hallucinate on physics largely because they sample frames and reason about motion in text space. The eval-set-as-released-artifact pattern (Reka is releasing a synthetic physics-realism benchmark) is a clean template for productizing diagnostic IP: ship the harness, not the model. The architectural pattern of cheap deterministic layer plus expensive reasoning layer plus orchestrating harness generalizes well beyond robotics.

Slides (10)

Slide notes

  • Three families framed by objective, not architecture: Language model describes reality, predicts next token P(w_t+1 \| w_t); Video model shows plausible futures, predicts next likely frame P(x_t+1 \| x_t); World model simulates intervention, predicts next state given an action P(s_t+1 \| s_t, a_t) [t=513:00]
  • Different objects, different objectives — the action term a_t is what separates a world model from a video model (counterfactual vs. continuation) [t=513:00]
  • Failure Mode 01: VLMs don't examine every frame — they interpolate [t=516:25]
  • Experiment A: random numbers sent as video -\> only 60-70% recall (some digits dropped/blanked) [t=516:25]
  • Experiment B: same random numbers sent as individual images -\> 100% recall [t=516:25]
  • Experiment C: consecutive numbers 1-6 sent as video -\> 100% (model interpolated the sequence rather than reading it) [t=516:25]
  • Conclusion: VLMs skip frames in video mode; consecutive/predictable numbers get interpolated, not read. Sending frames as discrete images restores 100% — implication for physics eval, where frames are not predictable [t=516:25]

Full transcript (1700 words)

Talking about uh how we go from language to physical intelligence. Uh my talk is about our path towards world models. So I'm Jackie. I work at REA where I am the head of applied AI. Um, REA is a multimodal AI for video and image and text. Uh, you may know us from some of the models we built uh a few years ago where we were like climbing the leaderboards. Um, we're very focused more lately on vision models and different modalities and uh at the lab we are working to understand how we can apply these to real world situations. So in terms of vision today um we are already um having a lot of these CV technologies that can do a lot of things right this is a solved problem being able to detect cars to detect things and to track items that's something that comes from computer vision um and we can use these to kind of help our deployments understand with more deterministic ways of what's going on within the video but you can see later on the video that the the machine doesn't actually understand like what it's actually seeing.

It might be able to see the heat map. It might be able to see the bounding boxes and this is where computer vision was before VLMs come in. So now we have VLMs and with VLMs we're able to look at a scene, think about the scene and then take action on the scene, right? We are able to do CV on top of it to kind of help it also figure things out over time. Um but this is kind of how we can apply LMS. Uh but we don't replace CV. CV is kind of on the side too. Um and another example of how we deploy AI in production is um here where you can add like detecting tracking and identification. Um here we still use CV as a very cheap step to kind of understand what's happening in the scene.

Uh then we use VLMs to do reasoning and then we use it to alert uh for specific use cases. Right? Neither alone is sufficient and neither alone is uh physical eye just yet, but these are the building blocks that we have that came from our language models. Um so BLM are able to predict the next token because we're able to take this visual space, encode it uh into some embedding and we generate the next token. So we can explain what's in a in an image and what's in a video over time. However, the output is still largely text based. Um there is this other paradigm that we also build models around.

Um where we're able to predict the next frame, right? So you've seen the fusion models where they generate images or or videos. Um this is also a path that now robots and physical AI is trying to use to uh uh trajectories for for robots. And these two models, the language model versus these video models are not exactly world models just yet. Um and for us uh we are taking we can go from both approaches right and both approaches actually help us craft this next idea of like what a world model is. So we want to predict the next action and that's the biggest thing that makes the difference between anything in anything out.

Um and we're going to talk about how we're trying to get there. Here's an example of how we're able to train a model from scratch. So this is not an off offtheshelf model. This is like completely from scratch diffusion model that was trained on video generation. So it can make cinematic films and cinematic scenes of 5 seconds. Um but when applied to robotics the main advantage now is that it's zeroot. So even though it in previous technologies you'd have to train a robot with previous robot arm techn robot arm movements um you have a diffusion model that is now tracing the trajectory of where the arm can go to achieve the objective.

Right? The biggest improvement is that this happens without the robot knowing what it was before and we're able to get pretty surprising results and there's many other labs that are kind of doing something similar to kind of control robots. Uh but where is the gap still? There's still a lot of things that we want to get better and and for a lab and when we build models the best way we do that is we understand what's broken and we create eval right. So actually VLMs are quite bad at physics. So one example is it will hallucinate. Uh an object might disappear. A object might get smaller some for some reason in the next generation.

Um it might not follow physics. Right? So one of the things that we're adding is we're adding uh an eval set to like kind of understand our blind spots uh for the other blind spots that we might have is that even though we evaluate a lot today, there's actually a lot of blind spots when we do where um even though the model is able to get the right output, it actually was sampled and we actually lost a little bit of data. Right? A lot of these models are also being judged by other BLMs. Uh so yeah, BLM's kind of judging each other to understand whether or not they're improving and this creates a gap as well.

Um so that's why um for us we are creating new data sets uh to kind of understand what the ground truth is. So all these things you see behind me uh are areas where the the models don't really understand right that's a ball game smaller. Um then you have like if something is falling is this falling proper properly when when two things hit each other what they do um and is the motion correct right does something spontaneously move and to be honest a lot of models are not able to predict this right now and this is one of these main like physics related graphs and um we create synthetic data to kind of understand what the realism chance is in our eval like even the best models today do not perform very well right and there's a reasons for that uh but I'll go over that now is that one of them is that BLMs do not look at every frame these large language model based approaches you know there's a lot of tokens that go into the to those these models and most of the times it needs to be sampled right so in our experiments we can kind of prove that like if you send it every frame it might understand but if you send it um a random amount of frames it's going to interpolate it's not going to understand what's actually going on.

So, that's one way it fails. Another way it fails is that when an object's just near the edge, um it's not able to actually see if the person disappeared or if they walked off the scene. And this creates a lot of um confusion because the model kind of assumes and predicted that the person disappeared even though they didn't see it per frame. Another area is that VLMs are really just going back to text. So it will reason about things in the text world. Um we have to give it more CV and like more like supplemental data for it to really understand uh what's happening within the scene. Uh it understands laws but understands it in a text space.

So it's able to more reason about it. Uh it goes back to like why our deployments today are actually more CV augumented where you have the vision model looking at the video but then also the the CV text explaining oh this scene has X identity and it's being tracked over many scenes and that's how we kind of help improve the VLM performance. So for us, we're using VLMs to kind of help improve how we judge physics. They're but ultimately they're still skipping frames today. Um we're using them to they're using them to match uh position, not motion. And uh they know physics just from what they learn from from the text based models, right?

And we're about to release some uh eval sets to kind of help other people improve their their models as well. Um so that they can also train the the the next embodied model. And for us um to recap for how we're going towards physical AI as a company um as we're building next model is we're still using our LM and our VLMs where we have the next token and that will be wrapped around a harness and that harness will help us control uh surveillance or it'll help us control robots. Um but we're also creating the path where we have the diffusion path where we have these video models that are now creating these control paths for robots.

Um and together they can be combined to create this kind of world model where we generate the next action. Um for the next step is this evaluation set because this evaluation set will help us understand if we're actually understanding what's going on or if we're actually flying blind. Um and yeah that that is our path that we take our language models to kind of evolve over time and now we're trying to um shift it to kind of help us support the next generation which is to uh build physical AI and world models. And that's that's my talk. Thank you. Thank you so much, Jackie. Next up, we have Gokul Shinasan.

He is the co-founder and president of Antim Labs. Now, he will be talking about simulation games and the future of robotics. And I think like he's got some really cool demos and videos as part of this. So, this is one to look out for.

Simulation, Games, and Faster Robotics Training Loops

Gokul Srinivasan (Co-founder, Antim Labs)

Summary

  • Gokul argues robotics is fragmenting across model architectures (world action models, VLMs, VAs, video action models, classical) and data-capture methods (teleop, internet video, synthetic, UMI), but the one common substrate across all of them is simulation — for synthetic data generation, digital twins, edge-case coverage, and policy prototyping.
  • The bottleneck is that building sims in Isaac Sim or Mojo is brutally manual: days to weeks per scene, even for experts.
  • Antim Labs' answer is Gizmo, a prompt-to-simulation tool that takes natural language or an image, spins up sub-agents, and produces a usable 3D sim in ~20 minutes, with APIs so Codex/Claude can spawn sims mid-workflow.
  • That unlocks closed-loop robot learning (e.g., 'train a quadruped to walk to point X' as a single agent instruction) and high-throughput sim generation.
  • The sim-to-real gap remains — contact physics, material properties, deformation modeling are still hard.
  • Part two: games as a training ground for high-level cognition (exploration under ambiguity, replanning after state degradation, decisions under partial information), all grounded in spatial-temporal memory.
  • They trained a 2B Qwen VLM as a computer-use agent on 400 hours of frame-action gameplay plus 60 hours of instruction fine-tuning, with reasoning-trace training in the pipeline, all running under a 200ms real-time loop.
  • Demo was early but functional.

Why this matters

Antim Labs' Gizmo collapses days-of-expert-work building Isaac Sim or Mojo scenes into a 20-minute prompt-to-sim flow with APIs that let other agents spawn sims mid-workflow, demonstrating that the moat in robotics R&D is the sub-agent orchestration pipeline, not the underlying model. Simulation is the one common substrate across every robotics paradigm (world-action models, VLAs, video-action, classical), so harness-style sim generation becomes a chokepoint. The sim-to-real gap (contact physics, deformation, materials) remains the hard ceiling for embodied AI.

Slides (23)

Slide notes

  • Slide 04/12 "Simulation is becoming infrastructure" — four pillars: (01) Synthetic Data: unlimited episodes without breaking hardware; (02) Digital Twins: customer demos before robot deployment; (03) Coverage: rare edge-cases on demand; (04) Evaluation: repeatable tests before deployment [t=524:39]
  • Framing: simulation is the substrate for robotics training — data, demos, edge-case coverage, and pre-deployment eval all run in sim before touching real hardware [t=524:39]
  • Slide 08/12 "Gizmo // Vibe-training inside simulation" — Gizmo API enables simulation at massive scale; end-to-end closed loop for robot training [t=528:22]
  • Prompt-to-sim demo: PROMPT 001 "Train the quadruped to walk to any object I specify in the scene." / PROMPT 002 "Train this robotic arm to pick up any objects on the table and place them in the box." [t=528:22]
  • Thesis: natural-language prompts compile into full training environments — the simulator itself becomes the agent's RL loop, not just a renderer [t=528:22]
  • Elden Ring gameplay footage rolling on-screen ("Use the Flask of Crimson Tears to restore health") — illustrates the 400h gameplay + 60h instruction corpus used to train a 2B Qwen-based VLM running a 200ms real-time control loop [t=532:06]
  • Architecture claim: games are the cheapest dense-supervision environment for embodied policies — VLM sees pixels + text, outputs actions inside the tight latency budget robotics demands [t=532:06]
  • Antim Labs positioning: "Simulation, Games, and the Future of Robotics" — robotics architecture today is fragmented across simulators, data pipelines, and policy stacks; Gizmo collapses the chain into one prompt-driven loop [t=528:22]

Full transcript (1644 words)

Good evening everyone. Uh my name is Gopal and I'm co-ounder labs and today I'll be speaking about um simulations games and how these are going to be really important themes um going forward in robotics. Okay. So since like the 1950s, 1960s, robotics has basically been in the cage. And what I mean by that is everything has been um pre-programmed. The environments have been fixed. Um the the scripting for like what what the robot supposed to do, everything has been fixed. So the environment has been purpose-built for the robot. And um of course to really unlock economic value, we can't have that where the environment is built for the robot.

The robot should work in existing environments. So um over the last 10 years 10-15 years a lot of work's gone in to make robots more and more general and um this has led to a lot of cool research. So one thing we see today is that even though there's been a lot of research the robotics community has no sort of answer as to okay what is the model architecture that's going to lead to a significant uh generality. So for example, if you if you're just looking at all of the latest research, we see world action models, u VLM, VAS, um video action models, and of course some people still employing classical algorithms.

Now, because there's different types of models, of course, we need different types of data collection methods. Um some of these are teley op just using internet scale video to train video action models uh synthetic data from simulations and also um UMI style uh capture. So these are all different types of data capture methods for um robot for training robots. So, one could ask now, okay, there's so many different types of models, so many types of data, like what's really going on? Is robotics just going to go in multiple different parts? And there's no um there's no real linking thread between all of them.

And I would like to argue that the one thing that's common among all of these methods is simulation. And what I mean by that is simulation is going to become an uh a part of the workflow both the R&amp;D workflow and the deployment workflow that you in all likelihood cannot escape. So um some of the places where simulation is going to be used are for generating synthetic data. Uh second is um you can create digital twins of environments and you want to make sure that they work in those digital twins before they you you know go out and deploy an actual physical robot. Uh the third one is for edge case coverage.

This is like really um well established and it's used quite heavily in things like autonomous driving and of course just to prototype policies before you deploy them. So, um, for all of these different, uh, you know, places where you can use simulations, um, even though it's it's going to be so ubiquitous, what what is the state of simulations is that it's really really hard to make them. Um, if I don't know how many of you have tried to build simulations or have used any of the simulation software like Isaac Sim or Mojo or something like that, but there's a real massive learning curve. And even once you've become an expert at it, it's still really hard.

And so what's on the slide right now is just the um workflow to create one asset and then place it. So you've so depending on how complex your scene is, you have to do this for multiple assets and um you know it just it's just really hard and it takes days and sometimes even weeks. So there's no reason for this to be the case. Um so with current agentic AI and a lot of the uh vision based models and language models, we can actually automate several steps of the pipeline or at least bring it as close to automated as possible. And so we built something called Gizmo. This is a prompt to simulation tool where basically you can give our system a prompt in either natural language or a just a picture and it will go out spin up a bunch of sub aents and it'll do whatever it needs to do and then at the end you just have a sim you have a fully built 3D simulation and this takes around like 20 minutes right now.

Um so you basically have your first pass of your environment done in about 20 minutes and let's say there's some human in the loop work required. It is still you know you can complete it in a couple of hours. Now this is contrasted to days or weeks. Uh that's what's being done right now. So I'm just going to play a demo of our Great. So that's the demo of the tool. So basically just prompt in something and then out you get a simulation. So um this un this unlocks some serious capability. So we're also going to have APIs. So what this means is that your codeex or open claw whatever you're using um in any part of the workflow it can just decide to spin up a simulation and uh you get a simulation out.

So this also enables massive scale. Right now it's just not possible to do simulations at really high scale because they're so hard to make. Um this also enables some really fun stuff like you you you can basically have an end toend closed loop closed loop for robot learning. For example, you can just say train a quadriped to walk to um a point in the scene that I specify or something and that is literally all the information that an agent needs to go out and actually do the entire thing and give you a policy for a trained quadriped. Okay, so uh is this it is is robotics solved? Well, of course not.

Um the simtorial gap still exists. What this means is that simulations um while they are useful, they're not 100% accurate yet. And this is fundamentally just a physics problem. Um there's problem with the contact physics and there's problems with you know we approximate the properties of materials and deformations are really hard to model and so um this is a hill that the robotics community and us are still climbing and we expect the gap to become lesser and lesser as the years go by. Okay, so we spoke about simulation. Let's go to games. Why are games important? So um in simulation you cannot only train manipulation or navigation or locomotion.

It turns out that if you're able to have a synthetic world you can train even highle cognition. And what do I mean by high level cognition is things like um exploration when when the goal is not clear. Um when you had a certain plan and then something happened in the world and your state degraded it. How do you recover? How do you replan? Um when you have when you don't have full information about the world, how how good is your decision making? So um all of these things are really important. They're not only important for robotics, they're also important for LLMs, but for robotics, they're they're specifically important because they it also needs to be grounded in spatial temporal memory.

So um I mean all of these things like exploration, um replplanning and uh you know long horizon planning, all of that needs to be grounded in spatial temporal memory. So we trained an agent and I'll just give you a very quick um overview about how over how we did it. So we trained a two billion um quen model VLM. So it's basically functioning as a computer use agent where it controls the keyboard and the mouse. So we pre-trained it on like 400 hours of uh um frame action video gameplay data. So that basically gives the model some instinct as to how to play video games uh with a pre-training and uh we did some instruction fine-tuning with around 60 hours of if data to basically steer the model through the game.

And finally uh this is something that we haven't done yet but uh it's in the works where you train the model to output reasoning traces and the reasoning traces then function as the instruction for the next step. Right? And finally, one thing we need to really keep in mind is since we we want to play a video game, we need real-time operation. So, um yeah, we need the model to take in the input, proc it, and decode the output within uh 200 milliseconds. So, now I'm just going to play a quick demo of um our agent. Um as will be clear to you, it is still early work and it's far from perfect, but um hope you'll enjoy.

Heat. Heat. And that's my time. Thank you. Thank you so much, Gokul. We're at 5:00 pm. We're in the home stretch. Very, very happy that everyone's still around listening to talks. Uh we're going to take a little detour and go into some aspects of design. And we're going to explore a different playbook now. So our next speaker is Weii Su from Lentil. And her angle for the talk is to explore the wisdom behind eastern philosophies and eastern product building. So this will be a very interesting talk at looking at design and AI but from a lens that is often not the center of discussion. So way whenever you're ready.

You want to go back? &gt;&gt; Yeah. Yeah. Hi everyone. Can you hear me? Okay, cool.

Eastern Product Building, Cultural Context, and Local Storytelling

Wei Wei Hsu (Wayi), Founder, GenZen

Summary

  • Wayi, founder of GenZen (AI video generation startup), reframes AI content creation through an eastern philosophical lens, arguing the dominant western Hollywood-led narrative is collapsing as the tools of storytelling become commoditized.
  • She contrasts western minimalism and singularity (Cash App, Meta's single ad-driven revenue stream) with eastern plurality and optionality (Alipay's superapp, Tencent's diversified revenue base), grounding the contrast in I Ching's premise that nothing is fixed and wisdom lies in navigating change rather than resisting it.
  • The Wandering Earth film is her anchor example of Chinese sci-fi succeeding on collectivist storytelling terms rather than mimicking Hollywood.
  • Her thesis: as synthetic content floods the feed, the real opportunity is not fighting slop but giving previously ignored communities (furry, traditional Chinese medicine, acupuncture, I Ching divination) the tools to produce and distribute their own narratives.
  • GenZen runs in-house agentic video workflows delivering ~10M monthly impressions across YouTube Shorts, Instagram, and TikTok for niche verticals, and has shipped a free I Ching Oracle app on the App Store as a productized demo.
  • The pitch closes on AI-generated content as a vehicle for a more vibrant, plural future where cultural authorship is decentralized.

Why this matters

GenZen's playbook (agentic video workflows productized into niche-vertical impression machines plus a free I Ching Oracle app on the App Store as funnel) is a clean template for turning AI capabilities into discoverable, downloadable artifacts rather than one-shot prompts. The Alipay-vs-Cash-App and Tencent-vs-Meta contrasts make the case that eastern plurality and revenue diversification are an underappreciated design pattern as synthetic content commoditizes the tools of storytelling. The opportunity isn't fighting slop, it's giving previously ignored communities the apparatus to author and distribute their own narratives.

Slides (67)

Slide notes

  • Title card: Wei Wei Hsu, Founder GEN GEN — 'A different playbook: the wisdom behind eastern product building' [t=536:08]
  • Speaker frames West vs East as the organizing contrast for the talk, gesturing one-finger 'west' then 'west vs east' on screen overlay ([gengen.now](http://gengen.now) watermark) [t=536:08]
  • Visual cue: split-screen showing Western pointing/gesture vs Eastern framing — sets up the cultural-context thesis behind product building [t=538:44]
  • Reference video clip plays alongside: a hand pressing acupressure/reflexology hand chart (labeled zones: Heart, Liver, Lung, Kidney, Spleen, Large Intestine Meridian, Triple Burner Meridian, Yintang, Digestive Health, Colon, Sciatic Nerve, Lower Back, Reproductive) — illustrating Eastern holistic/TCM mental model as product metaphor [t=541:20]
  • Implied contrasts cued by talk segment: Alipay (super-app, embedded life-OS) vs Cash App (single-purpose payments); Tencent vs Meta revenue-mix diagram (services/virtual goods vs ads); Wandering Earth as collectivist narrative example [t=541:20]
  • GenZen positioning: 10M monthly impressions cited as traction proof for Eastern-playbook content/product approach [t=541:20]
  • I Ching Oracle app screenshot referenced as concrete product instantiation — ancient divinatory framework wrapped in modern consumer AI UX [t=541:20]
  • Throughline: Eastern product building leans on meridians/holism/collective narrative rather than single-feature optimization — playbook for builders targeting Asian users or differentiating in Western markets [t=541:20]

Full transcript (1319 words)

Thank you for being here. Um, my name is Wayi and I run a startup called GenZen. We create AI videos to scale I'm going to be a little bit experimental today and I would like to spend some time today to talk about eastern philosophies as well as how this can shape the way we build in the future. This feels like something worth discussing because we're living in a time where Westerners are becoming Chinese on Tik Tok and are China maxing. So if you spend time on Tik Tok, you've definitely noticed this trend in the last few months. Not only so, the west is also paying increasingly more attention to both companies and AI models coming out of Asia.

A moment I want to highlight in this movie is um it is is this movie called Wandering Earth. How many of you have heard of this or watched this? Cool. And how many of you have heard of Three Body Problem? A lot more. Great. Um so wandering earth is also written by the same author Leo Sushing and it was a very important moment in sci-fi films because it was one of the first successful attempt for China to build a large-scale Hollywood sci-fi blockbuster rooted in Chinese story storytelling traditions. This is a story that's set in 2075 and the sun is expanding. Earth will soon become unlivable.

Instead of abandoning Earth, humanity decided to come together and build uh build 10,000 giant planetary engines on Earth's surface to push the planet out of the solar system. This plan would take 2500 years. And so in those in the next 25 centuries, they all agreed to live underground. Um watching this movie and seeing this collectivist mindset was very empowering for me. It helped me realize that we had been given one version of the story and one version of the future our whole life. Mostly created by Hollywood without really knowing about it. So for the longest time the western narrative has been in the center of how we build, how we live and also what we desire.

What happens if there's an eastern narrative in the center of 21st century? In the west, minimalism is typically favored. Apps tend to have one call to action on every page. For example, in America, you use Cash App or Venmo to send money to friends and to pay friends. This is what Cash App looks like. And on the other hand, this is Alip Pay from China. Not only can you send and receive money, you can also pay your bills, order delivery, or even take out a loan. So, in the east, vibrancy is often celebrated more. People want all the options. A lot of the times more is good rather than less is good.

This belief in the west also trends toward focusing on singularity. An example of this is for example a western company like Meta has been focusing on growing one revenue stream for more than a decade. So as you can see they rely on advertising. On the other hand, Tencent, the parent company of WeChatai, has been diversifying their revenue streams and they don't put all eggs in the same basket. When you put these two social media companies next to each other, the contrast is quite stark. And you can also see how that changes their behavior, how they deal with risk, and how they experiment in general.

While the east trends toward plurality, we also celebrate optionalities. So I couldn't help but wonder what led to this difference. Right? One observation is that the philosophy each culture embodies is very very different. While the west has Bible which is called shenanzing in Chinese which the holy scripture the east has something called eing the changing scripture. Its central argument is that nothing is fixed. Everything is in motion and a wise person doesn't really resist change. They seek guidance in navigating and also embracing it. With Eegene, practitioners tend to cast coins to generate six lines.

These are all the 64 options. Um the 64 hexogs. They offer guidance on life's changing situations. Over time, it becomes a cornerstone of Chinese philosophy, reflecting on ideas about balance, transformation, and also in and one of the changes that I think we're all living through in this era is content is synthetically generated. We're going to see more content that's synthetically generated than created by humans. And I think one question that we're all asking is will we be drowned by zero effort slop? Are we going to see slop puriferate and just flood everything? Right? How what do we do when there's so much noise?

Um but if we look at this from a different angle, the instrument of storytelling, so cameras, studios, distribution, the entire apparatus Hollywood built is collapsing into something anyone can hold. This also means communities that are ignored by Hollywood now hold the tools to create content and distribute them on their own terms. The stories that were too niche and also too foreign, too small of a market, too hard to cast, those are now producible by people who actually live them for the audiences who actually want them. For example, the furry community in China now makes content for themselves using AI.

And this furry animation gained 1 million views in the last two weeks alone. Another video creator in China has created an AI short film and it had gained 60 million views across all platforms just in the last seven days. Similarly, we at GenZen are helping clients make content in industries that are traditionally too niche. And this is really exciting to me because we're able to create broader access and also awareness to these verticals. In the last four months, we've delivered 10 million impressions monthly across YouTube shorts, Instagram, and also Tik For example, we've also made more egene content.

And to increase awareness around this, we built an app to enable everyone to get get a reading. Traditionally, doing egene reading can be a very complicated and perhaps confusing process for new beginners. So, this tool enables you to quickly ask your burning questions. And if you're interested, you could also try this tool out on App Store for free. Um, we made it for free this week just for you to try it. You can search Egene Oracle on App Store or scan this QR code. We've also created and scale content um around traditional Chinese med medicine, acupuncture, pressure points. These are also topics that are historically underlooked and these type of content are more easily created because of the tools we now have access to.

All of this is supported by our in-house agentic video video workflows and we streamline and optimize the content production process which in turn deliver impressions and also productive conversions to products. In a lot of ways, we see AI generated content as a vehicle to a more vibrant and also plural future where all of us hold the tools to create narratives we believe are important. With that, thank you so much for your time and you can find me online on Twitter at this ID. Um, if this is something that's interesting to you and if you would also like some of the stickers, um, please come find me afterwards.

Thank you so much. What a unique presentation. I need to figure out how to make slides and like presentations like these. So cool. Uh next up we have Anun Jooshi who's a tech lead for Bland. Uh and he'll be talking about voice AI. And we had a presentation from 11 Labs earlier but this one's going in a different direction which is voice AI is not a model issue. And we'll let Anun tell us more about this. &gt;&gt; Hello everyone.

Voice AI Failure Modes Beyond the Model Itself: Lessons from Scaling Bland to Millions of Calls

Anun Joshi (Bland, inventor of Pathways)

Summary

  • Core claim: voice AI agents survive enterprise traffic on unglamorous infra (VM detection, deployment discipline, tokenizer hacks), not on model quality — Bland now serves millions of calls monthly and credits one customer (American Way Health) with $430M/year in unlocked revenue.
  • Pain 1 — voicemail detection: most outbound calls hit voicemail or iOS/Google call screeners demanding 'say your name and reason'; Twilio's answering-machine-detection is a beep model and not enterprise-grade. Bland built a CNN over mel spectrograms and now publishes a public VM-detection accuracy benchmark.
  • Pain 2 — silent regressions: customers Slack 'why is my agent different today' with no code shipped. Bland's fix is canary deployments on dedicated per-tenant infra, versioned agent releases, percentage-based traffic routing, and phone-number-level A/B routing.
  • Pain 3 — tokenizer hallucinations on repeated digits: a Fortune 500 car-rental customer's ID capture broke because repeated digits collapse into a single token; prompt engineering failed, inserting commas between digits forced per-digit tokenization and fixed it (corroborated by a 2024 paper).
  • Meta-lesson: YC's 'move fast, break things' is the wrong default at enterprise voice scale — one-way-door vs two-way-door discipline matters more than velocity.

Why this matters

Bland's scaling lessons (millions of calls per month, one customer at $430M/year of unlocked revenue) say the voice AI moat lives outside the model: voicemail detection accuracy, canary deployments on per-tenant infrastructure, and tokenizer-level workarounds. The repeated-digit bug (digits collapsing into a single token, fixed by inserting commas between digits) is a free defect-rate win for any agent reading back account IDs, invoice numbers, or phone numbers. The meta-lesson: 'move fast, break things' is the wrong default at enterprise voice scale; one-way-door discipline matters more than velocity.

Slides (14)

Slide notes

  • Speaker intro: Employee #4 at Bland, 3 years building voice AI, millions of calls/month; thesis: "Voice AI is not a model issue" [t=545:45]
  • SCAR 02 / TRUST: Customers don't trust silent changes — canary deploy ramp 1% → 25% → 100% (canary → ramp → promote) [t=549:29]
  • Trust controls shown: pinned versions + one-click rollback; tagline "Customers can't trust you if they can't prove it wasn't you" [t=549:29]
  • Closing thesis slide: "The next great voice agent won't come from a better model. It'll come from whoever finally fixes the boring parts." [t=553:13]
  • Boring parts = infra/ops failure modes (VM detection, tokenizer bugs, canary deploys, AMD) — not LLM quality

Full transcript (1810 words)

Can you guys hear me? Good. Sweet. I hope you guys are feeling good. Um, I just want to say before we start, all the speakers have been amazing. So, can we just give a round of applause for all of them? So, also I actually changed my talk title because I did realize that voice AI does have model issues. So, I changed it up and I'm going to talk about some of the issues that I face while scaling up um voice AI for enterprise customers. Um so, I'm Anun. I actually grew up here in Singapore. I moved to San Francisco two years ago for Bland. And fun fact, I actually used to be a theater kid in junior college here.

Um yeah, I never thought I'll be back on stage again, but here I am. I do like storytelling a lot. Um, so I'll kick us off with one. So, two years ago, I was in San Francisco. I was just going on a coffee date with my CEO, Isaiah, and we were just hanging out, and he told me something that we still talk about to this day. Um, he sat me down, he looked me dead in the eye, dead pan, and he told me this. You're not going to believe me, but Pathways, the thing that you invented is going to impact millions of people, and millions of people are going to use it. And I looked at him, I'm like, "This guy's ridiculous." Like, he's he's just the typical founder.

He's trying to glaze me, make me feel good so that I work harder. Um, and at that point we were just Oh, well, one slide's missing, but I was going to show that we were just on Discord. It was me and another engineer. Um, we were talking about, we're just the FDEEs, we were the engineers, we were the product managers. Um, we're just figuring out the architecture for our agent with nameless and faceless people on bland Discord. Um, and it's crazy to think now that we're actually serving millions of calls every month. It still hasn't struck me that someone right now is talking to our agent. That's crazy.

Um, and I went on my Slack channel this morning too in team talk and there was a case study that came out with one of our customers named American Way Health and you can check it up on our website too. They said that we unlocked $430 million of revenue per year. I didn't know that was possible. I didn't know we could do that. Um so yeah, all of this has grown way bigger than I could ever imagine. Um I'm lucky to have lessons I've learned while doing all of this and some pain points too that I'd like you guys to learn from if you guys are trying to um integrate Voice AI into your services. So I'm sure all of you has seen a bunch of um voice AI like demos and they are super cool but the thing that's hard is productionizing it and actually making it work for enterprise customers.

Um so I'll go into some of the pain points and findings that I had for what we actually need to do to make voice AI work for enterprise use cases. Um, okay. The slides are different, but we'll just roll with it. Uh, I'll start off with the VO. The thing that I didn't realize that a lot of enterprise customers deal with and complain to us about is the voicemail detection accuracy. Um, I didn't realize that our current customers report and try to track the voicemail detection accuracy every single day. Um the reason for that is most outbound calls don't actually reach humans. Most of it goes to voicemail and ensuring that that's a robust system that works across various scenarios.

For example, with call screeners where now uh iOS and Google voice have checks right before the call connects. For example, they say um please say your name and reason for calling before we connect. Um, and there's a beep that happens. And what a lot of people use is Twilio, which has an answering machine detection feature, which essentially is just a beep detection model. And it doesn't work that well. Enterprise customers can't rely on it. So, I was working on that in bland to improve it. I was working on building a CNN model to look at the mel spectograms of every audio chunk. Um, and I didn't realize that beeps have so many different lengths and frequencies for phones and different phones.

Um, some frequencies also have dual band frequencies that are the same as what are called DTMF tones, which is what happens or the sound that you hear when you press digits on your phone during a call. Um, so you don't want to cause false false positives there as well. Um, so that was one of the hard things that we had to figure out and we even have a website now for you to test and benchmark voicemail detection. So if you guys are trying to um integrate voice AI into your systems, make sure that you're looking into how well their voicemail detection um accuracy or system works. So, next thing, um, there's going to be a slide of a Slack message that I received from a customer, and that Slack message said, "Why is my agent not working the same way?" or "Why is my agent not working the same way it did yesterday?" I don't know how many of you have dealt with customers telling you that or you yourself maybe have experienced that.

For example, I know with Claude, I hate it when things just change. Um, and from a business perspective, sometimes customers come to me when I didn't change anything. Like, I didn't push any new code and you're coming to me saying that I broke their system. Um, but I get it like you are spending hours working on their platform and their agent. And it sucks when something is just not working the way you expect. Um, one story of when I messed up as well is when I was trying to improve the hybrid search algorithm for our knowledgebased feature. Um, we have our own self-hosted vector databases and I was just trying to increase accuracy.

Um, it worked for some customers and it caused a regression for another and that sucks. It sucks to break the trust of your customers and that's hard to rebuild. What we've built and what I'm proud of building in bland was that we allow customers to deploy canary deployments and test out versioned agent releases. So for some context blind has dedicated infrastructure for each enterprise customer for data residency etc. And we with this we can allow them to spin up a separate container where they can test out a new agent release and send and roll out a percentage of traffic there, couple phone numbers to route there so that they have more assurance that any production changes are being tested out before it actually goes live.

So that's our way of trying to rebuild customer trust and that's super important for enterprise customers so that they can just focus on improving the agent the way it should work. Now this was this is a funny story too. So we're working with a Fortune 500 car rental company and we're trying to collect the car rental digit IDs. Um it yeah so that we can just help out with any other information they need to change and we went into production and we started realizing that okay the digits are actually different from uh what is actually there and we looked into our pipeline the transcription engine was correct the TTS was working the way it's supposed to the LM was the one that was hallucinating and the input for the digits were correct but it was saying and outputting something Um, I tried to prompt engineer my way out of it.

Didn't work. Um, and when I looked deeper into the tokenizer level, I saw that, okay, the repeated digits are actually being treated as one token rather than each digit being treated as a separate token. And that's just how the tokenizer is. Um, and the hack that actually fixed the issue completely was adding commas to each of the digits in between. The reason that worked was that the LM can then now treat each digit as a separate token and we actually found like later on that uh a paper was released uh which you can look up for sync and stro 2024 that was released after um we fix the issue but if you guys face something like that just know that you can look that up and adding commas will help um solve the issue.

It was only happening like five out of a thousand times. But if you're working with enterprise customers, that's five times too many. So this is a bit of a personal regret that I have um from there are decisions that I made when we were in early on seed stage and a lot of YC uh like advice is move fast, break fast, but I wish that I was a bit more intentional about some of the decisions I made when rolling out changes and yeah not causing as much customer pain. So, just being more conscious about one-way door decisions versus two-way door. Going back to that story with Isaiah, um he still teases me to this day about that that I didn't believe him.

And it is super empowering to know that you can have that much impact just from code. Um, and I just hope that you guys can learn from some of the lessons that I learned and the mistakes I made so that you guys can scale up for any other service or like integrating voice AI um to be even bigger than I could do. So, thank thank you so much for the time and yeah, my LinkedIn's here if you guys want to reach out. Thank you so much, Anon. Next up, we're going to look at this design. Uh, and we're going to be talking about going beyond flat design output and just going beyond autocomplete. So, how do we solve the complex design problems and enterprise design bottlenecks that come with AI?

And for that we're going to have Lin New who's the head of AI at Oello who will be sharing her thoughts on this uh once she's set up. It's it's really time consuming and costly to um create like marketing content uh on brand and at scales.

Beyond Flat Design Output: Editable Brand Systems and AI Creative Ops

Linh Nguyen (Obello)

Summary

  • Linh Nguyen demos Obello, an AI design platform pitched at marketing teams drowning in cross-format content demand (TikTok, Instagram, LinkedIn, billboards) where traditional design tools are slow, costly, and require specialist labor.
  • The core technical claim: generic image models suffer 'model collapse' when iteratively prompted to edit logos, headlines, or colors — Obello sidesteps this by producing fully editable vector-like outputs where elements can be moved, recolored, and swapped without re-prompting.
  • Each customer gets a private brand 'workspace' with a proprietarily trained model on their brand guidelines, signatures, and design-team assets — not internet scrape.
  • A brand asset ingestion flow pulls colors, logos, paddings, and tagged imagery from a URL, PDF, or Google Doc.
  • Key product surfaces: AI resize (semantic re-layout vs.
  • Canva-style stretch), AI Studio with people/subject/product modes for compositional generation (e.g. 'put this model in Rick Owens, generate across formats'), tag-based subject reference ('@product'), short-form video for dynamic banners, and an in-progress multi-resize that one-shots a designer's 10-size pack — collapsing a 1-2 week agency rearrangement task to one click.
  • The positioning: a harness around a proprietary brand corpus is the moat, not the underlying model.
  • Cited live client: a store at Funan Mall in Singapore.

Why this matters

Obello's defensibility is the per-tenant brand corpus and the editable-output harness wrapped around commodity image models, the cleanest live case study of how vertical AI productizes: customer-private knowledge plus editable artifacts, not generic prompts. The 'flat output vs editable output' framing and the named failure mode (model collapse on iterative edits to logos, headlines, colors) is a reusable narrative beat for any agentic workflow that needs structured intermediate state rather than raw model output. AI resize as semantic re-layout (versus Canva-style stretch) is a concrete demonstration of why composition discipline beats brute prompting.

Slides (115)

Slide notes

  • 9-grid of brand-system dashboards across enterprise clients (Reebok, Bobble, Tonum, Ritual, Allbirds, REI, Osea, Spanx, ebay) — each tile shows ingested brand: logo, type sample (Aa), color palette, photography library, video assets [t=557:54]
  • Framing: model-collapse on iterative edits — naive LLM image pipelines degrade brand fidelity over successive prompts; Obello's answer is a structured brand system that anchors every generation [t=557:54]
  • AI Studio prompt mode: free-text brief 'Create an ad for @marlo jacket and highlight the 2026 winter collect...' with @-mention to pull a specific brand asset (Marlo Jacket) from the library [t=561:18]
  • Style-reference slots: 'Add 1 to 3 images that capture the style you're aiming for' — multi-image style conditioning rather than text-only prompting [t=561:18]
  • Output controls inline: media-type toggle (image/video), brand/asset selector, aspect ratio (3:4), batch count (x1) — design-system constraints, not raw model knobs [t=561:18]
  • Funan Mall / Marlo Jacket client demo threads brand assets through to ad output [t=561:18]
  • Multi-resize one-shot: 'Sizes' panel auto-generates Social Media (1:1 1080, 4:5, 9:16, 16:9) plus display-ad set (Skyscraper 120x600, Wide Skyscraper 160x600, Half Page 300x600, Medium/Large Rectangle, Large Mobile Banner 320x100, 728x90 leaderboard) from a single creative [t=564:42]
  • 'Suggest Design' + 'Collection Maker (beta)' shortcuts — agentic layout assistance sits above the size matrix; one source asset fans out to every channel format with brand lockup ('obello') preserved [t=564:42]

Full transcript (1290 words)

So if you can see here as marketing channel multiply brands are faced with a relentless demand for content creation such as like when you want to create uh a marketing campaigns or advertisement uh across different format like Tik Tok, Facebook um Instagram and and so on or LinkedIn. Yeah. So we heard a lot of complaints uh and testimonial across CMO, head of design on of different companies big or small. They all have to admit that traditional design tools are slow, costly and reliant on specialized design skill. Not everyone can afford a big design or marketing team. So we introduced Oello. So it is an AI power design platform that enables team to instantly and cost effectively create onbrand content at scale.

So unlike like Canva where you can use rightway as individual but it won't be able to learn your brand signature your brand assets or brand voice. Boom. Boom. Yeah. So as you can see that like uh when we use AI generated image model or videos right we have like a problem of model collapse when you keep prompting it to say hey let's change this headline to another colors or change the logo or something like that. So when you use continuously use uh the previous uh generated AI image to fit to the next time when you prompt it it will lead to a model collapse. So we uh in Oberllo we have uh been able to turn those flat design into a fully editable where you can just move things around you can change the colors you can pair the colors that is learned from your brand So in here as you can see we have a lot of like workspace or domain and the models will be uh proprietary trained regarding uh to to their own brands guidelines signature and all that.

Yep. For example, Obert, I think that if you go to Funan Mo, you guys will see that there's a store over and they are one um one of our clients right now. Yeah, you can see that in here. Uh we um use a lot of like proprietary like uh training uh data from them and like design our design team. We train that uh and the model will be like fully you know kind of like private and not scraping from the just from the internet. Yeah. Uh so this is like one of the demo from our AI resize. If you ever try resize on canva you will understand that like sometime they will just like copy the element over and just stretch the whole canvas.

But in here you can see that it will smartly, you know, re reorganize all of these kind of elements around it. Yeah, you can see that. Um, so it not just, you know, copy over and stretch the canvas. Yeah. And when you replace a media with like another uh like videos or or image, it would change accordingly across all of the formats and and campaigns. Yeah. So that's what that's what uh that's how you do like marketing campaigns and advertisement a skill and on brand In here is how you use our AI studio uh functions that we have uh divided into you know people subject and product subject. Uh you can choose uh up to you know eight kind of images uh high quality and then you can just name them.

Let's say you will put it as Malo jacket or something like that. And then now you want to generate in an ad or a picture using this model. Let's say let she wear Rick Owen stuff. And we can you know at the same time generate uh to multiple format or size. All of this information has been you know uh intelligently uh saved in your brand domain. Let's say another example for train products and this is a design reference. So you have a design reference somewhere and you have your own train product and you want to you know kind of like just combine them together you can add tag like at and it will understand which uh subject you are referring to.

Yeah. Yeah. So this is the result from that. In here you can actually hit refine if you want to change any details of that and it will be like fully editable. You can actually change uh the text without you know kind of like prompting again. Um and you can actually you know open in in editor and do more of that. Yeah. So uh we also have like a short form videos which you can use to to be uh you know broadcast on billboard or any kind of like uh dynamic banners. Yeah. So that is all uh overall from our uh Oello platform. And here you can see that this one is a brand adset where you can just actually pull in um put in your URL or put in your PDF files, do um Google Docs or anything else and it will pull all of your color scheme, primary color colors, secondary colors, um logos, paddings and all that.

Yeah. And you can actually see that it will automatically tag your image like which kind which kind of product that it displays here. So we have like other work in progress which is not launched yet but you can take a look here. Um here is our monty monty resite. So you can check out like for example if a designer wants to create like 10 sizes at once. Uh he or she can actually just do an initial design with this with this kind of thing and then afterward uh they can just do like a collection marker or like the suggestion design like this. This is just like a very simple format but it can be a much more complicated layout that enable you to create uh more sop sophisticated campaigns.

Yep. You can see that we can select up to a lot of like different sizes and it will automatically you know kind of like expand like you like you see on Figma there's like an infinite canvas right so uh this is the result from the multi imagine like before if the agent uh the agency has uh to have to make like one week or two weeks to complete and you know rearrange all of is now we can do that with just a click. Yeah. So yeah, so for things that are loading it will it will load later. Uh which one is complete will display first. Um yeah, thank you for your attending. Yeah, that's my Awesome work.

Thank you so much, Lynn. Last two talks. Hang in there, guys. We're almost almost at the end of the first day of talks. To close the sessions out, we have two more talks. The first one is by Stefania Duga, who's a research scientist with Sakana AI. She will be talking about sovereign AI. So, how do you localize frontier models for certain countries? In this case, Japan since Sakana is based in Japan. Uh, I'll let uh Stefania set up and then uh we're good to go. &gt;&gt; Hello. Hello. Is the mic working? While I'm setting up, I know it's been a long day and you've been sitting and listening to so many talks.

So, I'm going to invite you to stand up for a second. Can you all stand up? We're going to do a breathing exercise. Take a breath in. Spread out. Okay, thank you for playing along. Awesome. Now we're ready to start. Uh, one second. Um, so good afternoon.

Sovereign AI: Local Model Ecosystems and Japan-Specific Deployment

Stefania Druga, Research Scientist at Sakana AI (Tokyo)

Summary

  • Druga reframes sovereign AI not as 'every country builds its own foundation model' but as local agency over global capability — owning selected layers of a stack: data, evaluation, adaptation (post-training/fine-tuning/RAG), routing, governance, plus a physical compute layer.
  • Pre-training is prohibitively expensive for most sovereigns; post-training is the practical sovereign control point.
  • Sakana demonstrates this with Sakana Chats (consumer product, 30k DAU in Japan, supporting standard Japanese, keigo polite register, and Osaka dialect) built on a post-trained Namazu model derived from open weights (DeepSeek, Llama, GPT-OSS) using Japan-specific neutrality and factuality benchmarks defined with a policy expert panel — Namazu outperforms base models on neutrality and answers sensitive questions (e.g., internet censorship) that base models refuse.
  • Their AI Scientist agent system (tree-search experimentation, agent-generated paper passed ICLR peer review, featured in Nature) frames scientific capability itself as a form of sovereignty.
  • Sakana Fugu uses a learned 'switchboard' orchestrator that routes tasks across a pool of frontier models — Japanese-context queries to the post-trained model, sensitive ones to on-prem or human review — and recursively calls itself for harder tasks; composition beats scale (Fugu beats every single ensemble member on LiveCodeBench and SWE-Pro).
  • Domain adaptation handles the 'missing dataset' of tacit expert knowledge via structured expert-feedback loops with MUFG and SMBC on credit memos, and government work on misinformation campaign analysis.
  • Their CTM (Continuous Thought Machine) research proposes a brain-inspired post-transformer architecture where reasoning emerges from neuron synchronization over time, with interpretable attention activations — questioning the dominant architecture is itself sovereign capability.
  • Closing thesis: no country owns every layer; sovereignty is the deliberate choice of which layers to own.

Why this matters

Druga reframes sovereign AI from 'every country builds its own foundation model' to 'local agency over global capability' by owning selected layers of a six-layer stack (data, evaluation, adaptation, routing, governance, physical compute). Pre-training is prohibitively expensive for most sovereigns; post-training is the practical control point, demonstrated by Sakana's Namazu model outperforming base models on Japan-specific neutrality and factuality benchmarks. The Fugu switchboard result (composition beats scale, with a learned orchestrator outperforming every single ensemble member on LiveCodeBench and SWE-Pro) and the MUFG/SMBC credit-memo pattern (1000+ expert feedback points loop) are templates for any domain where tacit knowledge is the missing dataset.

Slides (46)

Slide notes

  • Section marker: "1. Sakana Chat PostTraining" — segment opens the post-training case study for Japan-localized frontier model [t=569:32]
  • Research thread slide: "Coordination is a sovereign capability" — Sakana's bet is that the most capable AI systems are collections of specialized agents, not single scaled models [t=573:19]
  • Sakana research timeline: 2024 Evolutionary Model Merging (open models combined produce capabilities none had alone) → 2024-25 AI Scientist (coordinated agents run full research cycle) → 2025 ShinkaEvolve (evolutionary search over LLM-generated programs) → 2025 AB-MCTS (frontier models cooperate through tree search) → 2026 Trinity + Conductor (learned orchestration, ICLR 2026 → Sakana Fugu) [t=573:19]
  • Thesis line: "Each step strengthens the case for orchestration over scale" — Fugu is the productized switchboard endpoint of this research arc [t=573:19]
  • Global pattern slide: "Every country picks its own layers of the stack" — no single country tries to own every layer; sovereignty is a portfolio [t=577:05]
  • Sovereign-stack comparison table across 5 layers (Models / Data & eval / Compute / Governance / Deployment): Japan = post-train open, expert feedback, hybrid compute, METI + sectoral governance, Enterprise + B2C deployment [t=577:05]
  • Peer rows: Singapore/SEA = local voice / SLM, multilingual SEA data, cloud-first, IMDA + AISG, Govt + finance; Europe/France = foundation models, EU multilingual, sovereign cloud, AI Act, Public + industry; UAE/Saudi = Arabic LLMs, region-specific data, heavy investment, state-led, Govt + national; India = Indic open models, 22+ languages, public mission compute, MeitY framework, digital public goods [t=577:05]
  • Closing frame: "Same stack. Different ownership decisions. That is what sovereignty looks like in practice" — sovereignty defined as layer-by-layer ownership choices, not full-stack autarky [t=577:05]

Full transcript (1879 words)

My name is Stefania Dugga. I'm a research scientist at Sakan AI in Tokyo. And um, today I'm going to talk to you about sovereign AI. Um and what I mean by that is not necessarily uh any country building a local model um but more the ability to talk about local agency over global capability and think about that. So in practice when I think about sovereign AI I think it's important to consider three things. Um the data which data needs to be stay u local and what models are best adapted for local use. Compute and evaluation uh what sort of compute resources we need what workflows run on premise which workflows run on the cloud and accountability determining who remains accountable when we introduce AI systems into our institutions.

So I wanted to share with you a personal uh story of how I got interested in this topic. Um I come from a small village in Transennylvania, Romania. And uh before working in AI research, I used to run AI literacy workshops for children, families and educators around the world, including here in Singapore. This is a video from uh academia hackathons with kids in 2013. And what I learned in this workshops, in classrooms, in maker spaces, in libraries is that people are very interested in AI. They want to use it, but very often the AI models and systems are not adapted to their language and local needs.

And that translates into frontier AI capabilities of today. We expect communities and people to adapt to AI systems instead of adapting the systems to the local needs. And in Japan, this localization poses multiple challenges. We need to consider different registers for the languages, different cultural norms, different workflows, scientific practices, safety and security policies. So localization challenges is institutional and and sovereignty uh I want you to think of it as a stack right so it starts with data and figuring out what sort of unique data we need. Um then it goes to evaluation. How do we check for neutrality, factuality, specific country benchmarks?

Um, then we're talking about adaptation and this primarily happens through post- training, fine-tuning, rad tool use. Then we have the routing layer and here we need to have policy aware model selection interaction. What are our users? What are the different personas? What's it what are the different UX decisions in how we present these models and products to to the users and governance. So beyond that there's also a physical layer right because the different aspects of the stack have different needs for pre-training um we need a lot of data a lot of compute uh and the cost is prohibitive in most of the cases for post- training we need to care a lot about local norms and preferences and I wanted to show you some examples of how specifically we consider that in some of our projects and products.

So uh last uh um month in March 24 we launched our first consumer product Sakana chats and in this consumer product um we it's available for free for people in Japan. Uh it's equipped with web search but we're actually and it's available for anyone in Japan. We're actually supporting multiple ways of interaction. So we're supporting uh standard mode uh which is neutral default Japanese register but we're supporting also keo the polite mode which is more form used in formal context and we're supporting dialects the Osaka mode which is actually um giving answers in the kai dialect and people really appreciated this we have over 30,000 active users every day and in this particular project we use post training as a sovereign control point so we started with open frontier models such as deep sea, llama, GPTOSs.

Then we had unique uh Japan data for evaluation and preferences and we define a series of neutrality metrics with an expert of with a panel of experts in policy. Then we used this to post-rain this open weights models to create a model we call Namazoo. And we evaluated um we compared the evaluation between the postrain model and the base models. And we showed that the postrain model outperformed the original models on neutrality and factuality accuracy. But the fact that it outperformed them is not the only thing that matters. What we also showed is that many of these existing models would just refuse to answer uh questions that are more sensitive.

For example, if you would ask deepseek, please tell me about government inter uh internet censorship in various countries, it would either refuse to answer or give a generic highle uh answer. Uh after our post training we showed that Namazu actually gives a multifaceted um response with links to specific uh news articles that um are trusted. The second model I want uh uh project I wanted to show was our work on AI scientist that is focused on scientific capability as a form of sovereignty. So uh in this project we're actually using a multitude of agents that are supporting the entire research workflow.

So um the agents start with idea generation, novelty checks, idea scoring, then uh um we're using treebased experimentation to test these different ideas, generate the code for them, do ablation studies and at the end we are actually creating a full paper um presenting the results. And this work was uh la uh featured in nature last month as well. Um this is how the AI scientific scientist is using tree search to process like different hypothesis and test them and then pick the best candidate. And the paper uh generated with this uh system uh is the first uh fully generated paper that passed uh peer review at iclair last year.

Um the other example I wanted to show you is how we use multi- aent coordination. So for this a very important concept is the concept of switchboard. Um and this switchboard learns to automatically route tasks depending on how hard these tasks are to the most appropriate models. And like this we're optimizing for the cost and also for security. And routing can be seen as a form of sovereignty um not as a way of isolating specific solutions from global solutions. So if a request is um very relevant to the Japanese context, it's going to be sent to the Japanese postrain model. If your a request is very sensitive, maybe it's routed to the on premises secure model or maybe a human review is being uh solicited.

So this idea of coordination as sovereign capability is not only an architecture for focus for us but also a research focus. Uh we believe like our bet is that the most capable AI systems are collection of specialized agents and not single scaled models. And what we showed was actually in this model that we just launched the Sakana Fugu is that we can train a learned orchestrator to pick the best model um given a specific task. But this orchestrator can also learn to call itself recursively for harder tasks. And this work uh is now available in beta access and was featured in two papers presented at iclair this year.

In the evaluations of fugu, what we see is that composition beats scale, right? So um we compared fugu which coordinates a pool of frontier models as an ensemble. It per outperforms any single member of this ensemble on uh live codebench and sweep pro and other evaluation benchmarks. Uh next I wanted to talk to you about domain adaptation because we all know data is scarce and there's a lot of data that we currently don't have digitized like there's a lot of tacid knowledge and this is the missing data set. So when we're working with different institutions, banks, hospitals like healthcare, government um we need to have a process for integrating expert critique and feedback back into the model and the tools that we're developing.

So for example uh when we work with some of the major banks in Japan like MUFG and SNBC for credit memos we solit solicit over a thousand points of feedback that get fed back into the model that uh learns to create better credit memos for their expert Last but not least we're also supporting the government in Japan. So our team has uh showed that they can use an AIdriven intelligence for analyzing social media and show exactly how campaigns of misinformations are being started and And maybe the most important form of sovereign AI is to maintain local capabilities of questioning the dominant architecture.

So in our CTM work continuous thought machine we actually the team is actually proposing a new architecture beyond the transformer and this architecture is inspired by the brain where the reasoning emerges from synchronization between neurons over time. So instead of h having a single pass attention um there are multiple attention heads um that are um coordinating and such the model learns how to do pretty complicated tasks like solving a maze and the way it learns to do that it's also inter interpretable for humans because they can see the activations at the bottom. Um, we also tested it on image classifications where we could actually see exactly what part of the image the attention heads focus over time.

And the computing is actually adapted like for simpler images, it spends less time to figure out the classification than for complicated images. So those were only a few of the examples of the work that we're doing at Sakanam. Most of the projects I shared today are open source. They're on our GitHub and on our blog. Um we want to develop AI solutions for Japan needs and democratize AI in Japan and I share with you this this stack layer for the sovereignty right but every country picks which layers of this stack they want to own and they can own so not no single country tries to own every single layer of this stack so it's important to see how different countries make different ownership decisions and this is what sovereignty looks like in practice this.

And to close it off, I wanted to to leave you with this message from kids to parents to researchers to AI engineers. Um, it's very important to realize that we all have agency and that local agency is more important than global capability. Um, so thank you very much. Thank you so much, Stefania. And for the last talk of the day, we couldn't think of anyone better to have than Swix himself. Um, Swix is with Cognition, but he also happens to be the founder of the AI engineer conferences worldwide. And since this is our first edition in Singapore and Swix is from Singapore, it makes perfect sense to have him close day one of talks today for all of us.

So Swix, whenever you're ready, the floor is yours. &gt;&gt; Okay. Can you hear me? Uh I think I think they're switching the lapel mics on. Uh shift. Where is this? Should be okay. It's okay. I don't need to. Yeah, we're &gt;&gt; Okay. Hi everyone. Uh,

The Agent Lab Nation: Harnesses, Evals, and Enterprise Deployment

Sean 'swyx' Wang (Founder, AI Engineer; Advisor, Cognition; Singaporean)

Summary

  • swyx closes Day 1 wearing three hats (AIE founder, Cognition advisor, Singaporean) to deliver the canonical case for the agent-lab thesis.
  • AIE now serves 1.5M unique devs/month with 9,000+ on the Day 1 livestream. He maps the arc: 2024 = AI coding tools, 2025 = product, 2026 = year-of-deployment / decade-of-agents (per Karpathy).
  • Cognition made three non-obvious bets in 2023 he wishes he had made on his own 'smol developer' agent: choose code (code agents eat software faster than every other demo), barbell sync/async (the 'semi-async valley of death' kills middle-ground UX), and enterprise-first (the hardest eval that exists).
  • He contrasts model labs vs agent labs in compute-allocation, but flags convergence: OpenAI/Anthropic hire FDEs while Cursor/Cognition train their own models (SWE-grep, 3.5-class).
  • The 'just put it in a container' meme is dead wrong — agent platforms need stateful sessions, machine semantics, isolation boundaries, scoped identity, least-privilege permissioning; he cites real incidents where parallel agents shared caches and auto-accept mode published a whole company's source to a personal GitHub due to comingled secrets.
  • On evals: SWE-bench is marketing; Cognition runs a multi-dimensional internal eval suite, and 'enterprises are the hardest RL environment that exists' — eval quality correlates with publicly-disclosed ARR growth.
  • Devon's enterprise wedge in APAC: tens-of-millions/year in LLM-token spend per customer, banks running loans on spreadsheets owned by rotating analysts, 2M lines of undocumented COBOL with no owners, $100M AI budgets, 600-developer rollouts.
  • Solutions shipped: Devon Playbooks (structured templates parallelizing agents, 'a single playbook can be worth hundreds of millions'), DeepWiki (codebase comprehension, doc-generation-then-migration), and custom SSO as the gating sales requirement.
  • Closer: Cognition picked Singapore as its Asia HQ (acquired Havana, Nathan in the audience). Singapore has 4x demand-vs-supply for AI engineering talent, 40% YoY role growth. swyx 'gives up on the government' — calls for private-sector-led, citizen-driven push to make SG the 'agentic nation,' echoing FDM Abishek's deployment/democratization/decentralization framing.

Why this matters

swyx's closer is the canonical statement of the agent-lab thesis for 2026: the era of just-wrap-a-model is over, and the durable platform sits above the container as stateful sessions, scoped identity, least-privilege permissioning, isolation boundaries, multi-dimensional evals, and structured playbooks. Cognition's three non-obvious 2023 bets (choose code, barbell sync/async, enterprise-first) are now load-bearing for the whole agent-lab category, with model labs and agent labs converging from opposite ends (OpenAI/Anthropic hire FDEs, Cursor/Cognition train their own models). The APAC enterprise pain he names is the deployment frontier: banks running loans on rotating-analyst spreadsheets, 2M lines of undocumented COBOL with no owners, $100M AI budgets, 600-developer rollouts, custom SSO as table stakes. The 'enterprises are the hardest RL environment that exists' line is the strategic frame for anyone selling harness or eval IP into regulated industries, and the call to make Singapore an 'agentic nation' through private-sector-led, citizen-driven push (echoing the deployment/democratization/decentralization framing) names the lane before larger consultancies wake up.

Slides (35)

Slide notes

  • Title slide: "The Agent Lab Nation" — swyx, Advisor, Cognition; framed as the anchor talk at AI Engineer Singapore (AI Engineer Europe 2026 / Building AI Engineer with AI deck) [t=585:01]
  • "Sync \<-\> Async" slide with hand-drawn diagram "The Semi-Async Valley of Death": x-axis Agent Autonomy (100ms / 10s / 10m / 1h / 10h), y-axis Experienced/Observed Productivity. Left peak = Devin Desktop / Devin Terminal (Deep Work, Focus, Hardest Problems — "make faster for flow and interactivity holding quality constant"). Right peak = Devin Cloud (Parallel, Background, Commodified — "trade time for higher intelligence"). Trough = /handoff zone, "Not enough to delegate, Not fun to wait" [t=585:01]
  • "Cognition: The Devin is in the Details" — Perception ("Just A GPT Wrapper") vs Reality slide. Reality panel: "Frontier Model Diversity has gone up over the last 2 years" — top models on OpenRouter increasingly more diverse; LLM API market share in Enterprise also more diverse, esp in Coding (stacked-area + pie charts). Surrounding tiles show Cognition's own model/infra work: "Introducing SWE-grep and SWE-grep-mini: RL for Multi-Turn, Fast Context Retrieval"; "Introducing SWE 1.6: Improving Model UX"; Ray Summit 2025 keynote "Building Cursor Composer with Sasha Rush" — Composer improves with training chart [t=591:04]
  • "More Anonymized Demand across APAC" — concrete enterprise pipeline numbers swyx is seeing from the Singapore/APAC vantage: Big 3 Local Bank doing COBOL migration of \~2M lines on mainframe, no documentation and no remaining SMEs, board-level flagship; Global bank SG HQ with \$100M budget for engineering productivity; \>600 developers in a single rollout for ONE business unit; Government agency with \~2,000 apps incl. national housing portal (4M lines of COBOL) and legacy military meteorological apps in VB6/.NET WinForms; Regional superapp running a 12M-line monolith Java-\>Kotlin migration; Large startup with \~4,000 bug reports/week; Asian Telco \$50M/month revenue settlement system — not enough developers and 6-9 month backlog. Thesis: Singapore/APAC is the Agent Lab Nation — concentrated demand, COBOL/legacy depth, willingness to spend, and acquisitions like Cognition's Havana / Asia HQ build position the region as the proving ground for agent labs vs model labs [t=597:06]
  • Closing slide: AI Engineer Singapore "Thank You & See You Tomorrow" over SG skyline [t=604:40]

Full transcript (4872 words)

How you all doing so far? Enjoying the conference. Yes. Awesome. So glad to have you. Um, if you don't know, I'm Sean or also known as Swix. I come here in three capacities. One, I'm the founder of AI engineer. Uh, two, I am an adviser to cognition and one of the leading agent labs, and I'll explain what that is. And three, I'm here as a Singaporean. And I think all three of these identities merge together in this one talk, which I really wanted to share with you as well. Um, so let's go into it, right? Um, I don't think this clicker is working at all. All right, I'm gonna gonna skip the clicker.

Uh, so first I'm going to tell talk a little bit about our story as a conference. Um, I'm pleased to say, you know, like uh we've been uh this this conference is three years old. Uh it's going it's gone around the world from London to Paris to San Francisco to New York to Miami uh and now to Singapore and next to Melbourne. Uh we've been growing quite a bit. Uh we now serve 1.5 million unique developers a month. Um and uh over 9,000 people have seen today's live stream alongside of you in person as well. Uh we are really trying to do uh our best to grow developer community all around the world and serve the the going AI engineering industry.

Um but particularly the Singapore, you know, I've always been a Singapore show. I' born and raised here. I went to uh I left for college uh in the US, but uh I keep continually being a very vocal and public um advocate for Singapore uh especially uh for for fellow Singaporeans but also other people try to visit Singapore for the first time and I'm I'm actually pleased that we brought like Stefania and like a lot of my sort of international friends to visit Singapore for the first time. Uh and in fact I one of my launching pads for my own career was in Singapore. Uh I spoke at GSCOM Asia, still one of my favorite talks I've given of all time.

Um and that really start gave me the potential for like what conferences can do for uh not only my own career but also galvanizing an industry, galvanizing a country uh together as well. Um I did also organize a lot of Singapore meetups, so I'm like kind of not new to this. Um here's some of our friends including Lihao and Thor and Thomas. Uh some of you who have seen who were familiar faces in the sort of engineering and conference circuit as well. Um more recently about 3 four years ago I moved to San Francisco um and started Leighton Space. Um hands up I don't know if anyone has heard of latent space my podcast and yeah okay thanks so much for listening.

Um and as part of that I had a realization that there would be this thing called the AI engineer. Um and I started I wrote this like infamous line that I'm going to live down for the rest of my life. Uh where basically there's this sort of forming gap between the research engineers and the full stack engineers um and that is effectively what all of you are today is AI engineering I think it's a hugely growing uh demand if you don't know if you came to this conference and you've not read the blog post you probably should uh read what the the definition of AI engineer is um be just just around at the same time I actually started hacking on my own stuff I'm not just a content creator I'm not just a community person.

Uh I'm also a builder. Um I'm just not a very good one and I'll be super honest about that. Uh so I started building my own coding agent. It got super popular. It's called small developer. Um and it was built built on claude one if you can imagine. Uh three major versions of claude a go. I was building on this thing. Um and I got very excited about it but ultimately couldn't really scale. And also the model weights degraded on me overnight. Uh which I I know it's a conspiracy theory but I swear mine was true that uh the model got dumber overnight. Um so I stopped building it but uh over lo the whole like I I moved on to sort of greater and better things.

So the very very first AI engineer I declared that there would be three types of AI engineer um and I didn't you know I started sort of broadening out and in fact that was probably a career mistake. Uh what actually happened in the subsequent three years is exactly this sequence uh where 2024 we build out more um sort of AI coding tools, 2025 more product stuff. Uh 2026 is definitely the sort of year of deployment of agents. Um and yeah this sort of kap Andre who's a little bit of a mentor of mine um drove this sort of stake in the in the ground when he last year he said that this is the start of the decade of agents right if you take the founding of open AI in 2015 as the starting point um uh uh and taking the ne the first 10 years of scaling well what happens in the subsequent 10 years is probably the deployment uh and building out the the harnesses and the scaffolding uh that that becomes agents Um and that's really kind of the path that led me to cognition.

Um uh they made three choices which I wish I made when I did small developer and I wrote about the AI engineer in 2023. Uh the the the three non-obvious choices were choosing code um bridging sync and async and focusing on enterprise. I think each of these things were not like sounds super obvious now. In 2023 you wanted to build chatbt you wanted to go consumer. Uh in 2023 you probably wanted to do auto reggressive uh LLMs and not really think about sync synchronous sync agents um and and and code was one modality out of many modalities um but I think uh you know uh it business has shown that it is the king modality so choosing code um I think this is something I wrote in my blog post on on on cognition where I really talks about like code is a proxy for software like coding agents and if the and basically If software is eating the world, then code agents are eating software and it it actually starts to accumulate a lot of the power and the the economic value and it can probably do it in a much shorter time frame than all the other agent demos that you've seen that are probably not working as well.

The second part is something I've written about in uh in this other blog post called the semi- async value of death. There basically there is no middle ground. You either want your responses very very quickly uh or you want to delegate asynchronously. And I think um there's this sort of uncanny valley effect that happens when uh responses or LMS are like fast but not fast enough where you're sort of waiting there on the phone whether it's voice or code or whatever other modes of interaction. So you basically just want a barbell approach of um uh having the most synchronous live uh experiences or the most async experiences.

And I think any company that can adequately straddle the two uh is going to do super well. And uh finally, enterprises. Um I think this is something that um abstractly makes sense. Like obviously you want to go after like the big logos like the City Banks and the OCBC's and the Goldman Sachs. Um but I think it I didn't really appreciate why. So I'm going to spend a little bit more time in sort of double clicking on this just so you understand really what being enterprise focused means. Um and enterprise focus I I think in very plain terms serving serious customers. A lot of AI customers are non-serious.

Like they will try your tool and then they will not give you feedback. They'll try your tool, they'll turn after three months for the hot new thing. Um, enterprise is the most serious vetting you'll ever get. Um, what do you what what so what does that mean? Um, a lot of tools start out single player. Enterprise is immediately multiplayer to the tune of tens of thousands of developers, tens of thousands of repos. Uh, the pricing power is also very interesting. Uh instead of seeking instead of starting with like a standard $20 a month plan and seeking the maximum subsidy and getting pissed off whenever people remove the subsidy and then moving on to the next best subsidy.

Um people are willing to pay for outcomes because this is enterprise we're talking about. Um and also but but to me the most interesting thing is being the first to discover expensive problems. Um and that's probably only discoverable at the uh enterprise scale. Um so this is the sort of standard cognition site. I'm going to show you my version of it which hopefully uh is more memorable. Uh all in all I call this the Devon is in the details which is kind of a nice pun. Um and this is the the the subject of part two of the talk right like I'm not here to talk about cognition. I'm here to talk about what I learned from cognition in case you guys end up building an agent lab or working at an agent lab because I think it is probably the most single valuable lesson for any AI engineer.

Um so for for reference I wrote this up in a post called the agent labs thesis. Um this is the November AI engineer summit that we did in New York. Uh where we listed agent labs on one side and model labs on the other. You can look at those conferences on YouTube. Uh if you want to see what examples of agent labs versus model labs look like. Um but if you want it in one chart, this is probably it. Um where model labs uh proportionately allocate resources towards training and compute uh and less towards uh deployments. Obviously that deployment has gone up over time. Um, agent labs are more or less the complete opposite in terms of the resource allocation and prioritization, right?

Um, and I think this is mostly holding except that they are starting to encroach on each other's turf. Like when I wrote this it was like more clear now model labs are building agent uh labs internally uh with open eye and topic also doing uh hiring FDAs and then agent labs are building models internally as well with both cursor and cognition uh rling like like putting a lot of compute in rling their models. Um if you want to sort of break it down in terms of like uh contrast you can also do that in this way but I'm going to skip over this uh for the sake of time. Um, and I think like the the details is really what I really want to sweat, right?

Okay. So, for example, uh, a lot of people will say like just put your favorite coding agent of choice. I don't want to name any uh, ones to to not piss them off. Uh, just put it in a container. Um, the the reality is that it is not just about the container format. Uh, it's also about just building stateful sessions. Um, and these are all the problems that have historically come up, right? Uh, it is about giving it real machine semantics, about giving all the tooling for real computer use. Um and here's a fun example of a real life situation where shared machines uh if you want sort of multi-tenency for your sessions stateful sessions for coding agents it will actually break right so this is a real incident um that these are real incidents with the same root cause right um so real incidents for example parallel agent sessions interfering with each other because they have a shared cache um or agent and auto except mode published the entire company's source code onto a personal GitHub because why they had they had the secrets uh comingled right Um so what they both share is that basically you have no isolation boundary in a container like container just only knows about one thing uh but it doesn't really uh it's not really set up for crossing or changing context between agent sessions.

Um so basically what you end up building is an agent platform which is everything above just the the VM or the container. Um and this is the a full list. I'm basically kind of open sourcing this. If you guys wanted to build an agent lab, these are the things that exact things that you have to go through. If you're if you're considering buying, this is what you have to evaluate whenever uh you are encountering a new agent lab for the first time. Um security is a very very important one of of course especially if you're natively multiplayer with multiple levels of teams and orgs and all these things.

Um so uh agents like definitely need a lot of scope identity and lease privilege and these are all things that you sort of have to work through in terms of your your permissioning model. Um second perception just the GPT wrapper right like that's what all that's all that application layer people are. Um I think to some extent you can be proudly a GPT rapper but you the the entire name of the game is just to make it thick and worthwhile right. Um so the reality is they actually long model diversity which has been a historically very good bet over time right like model diversity has a demonstrated um tendency to increase over time um open market share used to be like 70 80% now it's down to 30 something uh depending on the the source um and uh also you're not just training you're not just wrapping other people's models you're also increasingly able to train your own based on your own domain specific data uh and use cases as well.

So both uh cognition uh which uh these sweet grabb models and 3.5 models which I worked on as well as cursor are also doing that um and I think any other competent enough agent lab will have enough resources to to to build it and you should do it because it's going to be much more bu fit for purpose right like um uh for for the majority of your workload. Okay. Um, one more perception. Um, eval is such a nebulous marketing concept, right? Like, uh, you most people just tell you to look at Sweetbench like my number is 0.1% higher than the other number. My model's better. Uh, in reality, like reality is extremely multi-dimensional.

Um, so here's all the examples of different kinds of evals that Cognition can run internally. Um, and it's not summarizable in Cbench. And obviously you're going to want to have different approaches for each of these uh real life use cases. Each of which can have tens and hundreds of millions of dollars behind them. Um so my my sort of spicy hot take is enterprises are the hardest eval that you can possibly get, right? Like show me an RL environment that is harder than that than the enterprise. Um cognition itself is an enterprise with multiple orgs and multiple slacks and multiple uh IT systems and all that.

Um it has been really solved in the last uh six months which which to me like having joined more than six months ago like it was really interesting to see that oh like I I thought that was good and now I now I have a different definition of good. Um and it's interestingly correlates with AR growth all of which has been publicly disclosed. So I'm not telling you anything you don't know. Um the the the new stuff I'm I'm I'm going to show later on. Uh, but I do think that uh that's one of those things where you do have to keep track of like how honest are you about how much of a problem you're solving in the world versus uh doing fun demos.

Uh I think one of the interesting things is also communicating um like what kind of outcomes you people are paying for. Um it's very hard to do that on a landing page, on a brochure, on a talk. Um, so I basically don't bother like I just paste this in here just because like people expect me to paste this in here, but I'm just gonna I'm just going to skip past it and I'll tell you more locally specific stories about what we found in APAC in Singapore because that is basically why I can open source. Yeah. Okay. So that's the part three like why Singapore like why am I here? Um, and I think the the sort of if I can summarize the sort of uh the story of Singapore's economic development started with trade, then we went to oil, then we went to finance.

We had a little bit of a fling in bio. Uh, let's not talk about the crypto side. Um, but like what's next, right? So, my spicy take is that we have all these sort of leading leading figures. Fun fact, did anyone know that Keo and Sam Corp merge into catrium? I just found that out. You one person knows. Um, so like anyone who's actually Singaporean would be like, "Yeah, Sim Corp, Marine, Keo Corp." What is Catrium? Um, it's the it's the the new entity. Anyway, my my sort of cheeky answer is that there is obviously a fourth phase to Singapore's economy and it is here. Um, uh, I'm here because Singapore has been chosen to be Cognition's Asia headquarters.

Um, which is yay like very, uh, super fun. Uh I think you have to sort of uh even as a Singaporean I think you had to do go through this journey and it's something that we've always wanted right like uh we always want MNC's you know the local term um to to choose their base here not just for sales which sales is fine sales is great uh but also for engineering also for research um to me you have to succeed overseas to um to be recognized and do well locally which I call the Sununu um and it's not just GTM so uh there's all these quotes uh which I which I really like. Um uh Cornish hired or acquired Havana.

I think Nathan is in the audience somewhere as well as some of the other crew. Hey Nathan. Um and definitely talk to Nathan uh later on if you want to join COG. Um and so I think like it's it's this is working right. All I'm trying to say is like I have been part of the Singapore tech scene for my adult life and we've never had this level of foreign interests and American interests uh in this region in Singapore in basing engineering and research in Singapore until now. So now is the moment. Let's do it. Okay. Um um let me uh so I'm gonna uh so I actually had Nathan uh who who is my chaji uh go through all the call logs right uh uh of of all the work because there is so much work that that happens behind the scenes that you guys never see because you're not in this business and we are and so I I wanted I wanted to share some examples right um uh here's some examples of like the the sheer amount of demand that APAC has right um millions tens of millions of dollars a year spent on LM tokens.

Okay. Um and and the way that they they run your loans, your money is on spreadsheets by business analysts who do not stay there. Right. So imagine if like you come in and you're like what the this the bank is run this way. Yes. Right. So you have to systematize it. You have to co you have to write the code that is otherwise uh manually operated by business analysts. Uh same in government. Uh same in uh in in other parts of tech. Um I think uh you know again like this is the this is the sort of normal way that we present these things. These are all real numbers from the customers not not from the not from the company but I think like it's hard to tell unless you it's hard to tell from the just the numbers itself like okay what does it mean for 10x acceleration and delivery time?

Well let me just show you the baseline right like the the baseline is for a local bank you two million cobalt uh lines of code of cobalt with no documentation and no engineers owning it. What are you going to do? Um and and so like this is where you can start to really apply AI and uh let me tell you this is not unique to just Singapore or just one bank. It is all banks. It is all it is any any company that has like a really large uh scale of of customers which is enterprise. Um hund00 million uh budgets for AI, 600 developers per roll out. Um uh you know like it's it's really uh mind-blowing the the amount of work that needs to be done that we cannot hire humans for because it is boring.

It is it is on a a language or system that nobody wants to work on whatever right um so I I'm hopefully sort of sharing some of these new stories for the first time. If you you know if you want to to ask more ask Nathan. Um but uh I just want to share some solutions that uh cognition has figured out that um that that has worked right. Um Devon has a thing called playbooks uh which are basically way more structured than normal chat that is basically like a single playbook can be worth hundreds of millions of dollars in my opinion uh because they are structured templates that can parallelize agents in a much more reliable way than just open-ended chat.

So if you haven't tried a Devon playbook, you definitely should because these guys are using these things to transform banks and make uh billions of dollars. Uh decodebased comprehension is the mode again why no docs right so of course you want AI to write the docs first and then use the docs to do the migrations so uh cognition was the first pioneer in deep wiki um which uh I think a lot of people love as well uh there's billions of revenue in brownfield work and yeah lastly I think uh it's a standard thing in enterprise but it is uh so so sort of uh surreal or so visceral and physical to see people and sales people say okay that the guy won't even get on the phone with us until we have custom SSO.

And like why? Because they've locked down their GitHub and GitLab because they're responsible enterprises. And like the rest of us don't like we just yolo stuff into our obsidian and our and our um personal sort of open clause. We don't really think about that. But when you have the trust of millions of people, the money of millions of people, of course, you need to think about security like that. And anyone that serves these enterprises also uh transitively has to do that. So that's uh that's why I call sort of agents in Asia. Okay, let me sort of uh vector out again. Um I shared some learnings about APEC.

Now I'm just talking about Singapore and why why I want to call this the agenting nation. We're not there yet, but we're getting there. Um and we have to refer back to our our dear forward deployment minister. Um Abishek I think who is in the audience somewhere named him the forward deployment minister. I think it's kind of sticking. Everyone's kind of like that. Um he said three things in his talk this morning. He said that we have an advantage in deployment, democratization and decentralization which again not crypto. He actually just means that he wants uh he wants AI everywhere in in public service.

Uh and I think we can help in all three of them. I think like this is actually really really good that he gets it and like the rest of us can get it. Um to me it was it's such a shock to realize that like um Singapore itself has such demand uh roughly four times uh demand and supply of AI engineering talent. Um and you know like it's it's cons going to continue to widen and grow, right? Like demand for these roles are like growing 40% yearon year. There's so much money at stake. Uh and this is LinkedIn surveying the field and really sort of reporting on this. So I I think it's like pretty credible number as well.

Um so my spicy take is I have stopped I've given up on the government. Like I I know I've just praised uh minister but um I've waited for years for the government to do something about the tech sector. Uh I had this podcast with Minister Josmin uh and there's us walking together and talking about the future of it. Nothing happened. Um it is only when the it is only when us when we the people of Singapore, we the citizens of Singapore decide to take matters into our own hands, right? Like um I think Singapore has a history of government-led economic development. Um I think that I think the new age is going to be led from the private sector first into the public sector.

So let's go and make that happen, right? Um and I think this this conference is an example of that. We didn't wait for the government to approve it or like uh give us their support. It's very nice to have IMDA and AI Singapore in in the uh in Pullman and all the other expo stuff supporting us. It's very nice to have the Ministry of Foreign Affairs supporting us, but we do not need them. We are we're here to work on the private sector and build ourselves up as a tech sector ourselves. So it starts with everyone being high agency particularly all these organizers standing in the wings here. Give them a round of applause.

Like they made this happen. This is their side project. They put you guys together. Um and I obviously I I I helped and supported them. But like this conference would not happen without them. So um it's it starts with everyone in this room. It started with me. It starts with with these organizers of 65 labs and it now it starts with you. Um, so I really hope that you can come away from AI engineer and be more agentic in your lives and and really turn Singapore into a more agentic nation in general. So thank you very much. Well, we've come to the end of the first day of talks. Uh, thanks for staying all the way till the end.

This is crazy. Like give yourselves a round of applause for making it to the end of 10 hours of Okay, final notes before everyone heads out for dinner. Uh, we do have an afterparty here. We'll open the doors at about 9:30. I'll start DJing at 10:00. We've booked a DJ who's flying in from the UK who'll start playing at 11:30. There's free flow for the first 500 people. So, if you want to come and drink, be my guest. The chairs will go down. This will become a dance floor. Um, and we want you guys to come and have fun. Um, if you are a conference attendee, please bring your lanyards with you because that'll help us prioritize your entry and do not lose them because we are not printing new lanyards tomorrow.

Hopefully that part of the instruction set is clear. If everything is good, thank you so much and we'll see you bright and early tomorrow morning or tonight.

Day 2 is live ↓

35 more talks — the practitioner-track day, decoded.

→ Go to Day 2
AI Engineer Singapore · Day 2 · 17 May 2026

The Year of the Harness, Canonized

All 35 talks from AIE SG Day 2 — summaries, slides, transcripts, and the operator-mode read on why each one matters. The practitioner-track day, decoded.

35 talks ~10 hours of content 500+ slides captured 80,000+ words of transcripts
Day overview

What happened on Day 2

AI Engineer Singapore Day 2 (17 May 2026, Capitol Theatre) was the practitioner-track day — and where Day 1 functioned as the coronation of the harness layer (Vivian, Sottiaux, GovTech, swyx), Day 2 functioned as the canonization. Across 35 sessions, every serious operator on stage — IBM's Tejas Kumar, Google DeepMind's JJ Geewax, Cloudflare's Abhishek Kankani, Smithery's Henry Mao, Mastra's Sam Bhagwat, Hrishi Olickel, Lightsprint, OpenClaw, Adaption Labs' Sara Hooker — described the orchestration / loop-control / eval / sandbox layer above the model as the durable surface, and treated the model itself as commodity infill. The center of gravity was the operator-shipping-into-production stack: how do you run a sub-200ms typed-MCP call from a V8 isolate, schedule your own inference like a hedge fund books futures, sensorize a human to fix robotics' touch-data gap, or wrap Cursor and Codex inside a harness around a harness (Lightsprint) from three Singapore co-founders. The day's energy was harder, more battle-tested than Day 1 — less sovereign keynote, more shipped-this-in-production case study. The closing arc was the sharpest: Hrishi's “no country for old code,” Henry Mao naming the “harness era of agent agency,” and Rach Pradhan's mic-drop that the Bitter Lesson is coming for tooling itself — agents that evolve their own harness will outcompete hand-crafted ones.

Key themes

What every builder should leave with

Harness-as-moat moved from consensus to canon

Day 1 made the harness layer the consensus position; Day 2 made it the named primitive. Tejas Kumar (IBM) gave the six-primitive deep-dive, JJ Geewax (Google DeepMind) gave the production frame ('surround the non-deterministic model with deterministic boundaries'), Henry Mao (Smithery) coined 'the harness era of agent agency.' Mastra, Lightsprint, OpenClaw, Interfaze, PI, Fence, Hrishi, and Rach Pradhan all reinforced it from different vantage points. The argument is no longer being made — it is being operationalized.

The eval / data-runtime layer is the next moat below harness

SallyAnn DeLucia at Arize opened the day with planning-state + LargeJSON + skills-as-markdown infra; Timothy Lin at Resaro sold sovereign-grade ODD evals to Temasek/DSTA/IMDA/GovTech/Synapxe/Rheinmetall; Sara Hooker credentialed the move past brute-force scaling; the entire robotics block (the Krishnans, Bifrost, OpenGraph, Cortex) converged on 'the moat is data + eval + ops, not the model.' The agentic-data-runtime category is now real, named, and being capitalized.

Singapore is the speaker bench, not just the venue

Tusk/Fence (Jun Yu Tan), Resaro (Timothy Lin), Lightsprint (Heng Hong Lee + 2 SG co-founders, YC-backed), Smithery (Henry Mao), Hyperspell (Conor, partly SG-operating) — SG operators dominated the harness layer of Day 2. Combine with Cognition's Asia HQ announcement on Day 1 and the SG node of the global AI map is locked at the operator layer, not just policy.

Design judgment as the moat above the model

Josh Newton (Microsoft AI) opened the design block with 'design is the difference'; Jay Demetillo extended with 'prompts don't have opinions'; Alex Lee at Magic Patterns, Sabina Cabrera at MagicPath, and Priyaa at Lica ran a literal design-tool wars block. Pattern: when the model commoditizes, the durable edge lives one layer up — taste, brand fidelity, layered editability — as well as one layer down (orchestration).

Inference markets and agent-compute scheduling crossed into payments vocabulary

Vincent Wu at MiniMax explicitly framed agents as 'scheduling their own compute' through 'inference exchanges' — the cleanest bridge between payments-rails reasoning and AI infra delivered all weekend. Combine with Cloudflare Code Mode's economics (V8 isolates as the per-call sandbox) and the question of 'who books inference for whom, at what price, with what settlement' becomes a real market design problem.

Personal cloud and the company brain — the agentic non-code frontier got three credible operators

Ben Guo at Zo Computer made the cleanest case for productized personal infrastructure; Conor Brennan-Burke at Hyperspell made the case for the company brain built from docs + Slack + execution traces; Matthias Lubken's PI added the embeddable coding-agent angle. Three different shots at the same target: distribute agentic capability to people who don't write code.

Bitter Lesson is coming for the harness itself

Rach Pradhan's closing talk was the most provocative late-day signal — agents that evolve their own harness will outcompete hand-crafted orchestration. Hrishi Olickel's 'declarative budgets' points the same way: the move beyond hand-written agent loops to agents that rewrite their own loops is the late-2026 / 2027 arc.

Where the market is going

The operator-mode read

Day 2 made the operator-buyer pattern unambiguous: enterprises are paying for harnesses with deterministic boundaries, named primitives, and shippable governance (Cloudflare Code Mode, IBM agent factories, Mastra production patterns, Lightsprint plan-preview-orchestrate, Smithery's MCP distribution surface) and for eval / data-runtime infrastructure that survives regulator scrutiny (Arize Alyx, Resaro mission-critical evals, Hooker's Adaption Labs adaptive-intelligence pitch, the entire robotics eval block). Capital is flowing to the layer that captures the workflow: Smithery is fundraising on the MCP-registry thesis, Mastra is the TypeScript orchestration play, Lightsprint is YC-funded as a 'harness around harnesses,' Zo is productizing personal cloud, MiniMax is selling inference-exchange compute as a market, and Adaption Labs is positioned as the post-scaling research-and-product bet. The pure-model layer kept commoditizing — Day 2 had Anthropic's Sonnet, Claude Opus, OpenAI's GPT-5, MiniMax M2, Qwen, and Gemini all named on stage as interchangeable substrates under the same orchestration. Categories accruing value: orchestration runtime (Mastra, OpenClaw, PI, Lightsprint, Smithery), eval and data infra (Arize, Resaro, Bifrost, Cortex, OpenGraph), domain-bound deployment (Adaption Labs, Resaro into SG sovereign buyers, Lightsprint), design / UI generation (Magic Patterns, MagicPath, Lica), and personal-cloud / company-brain (Zo, Hyperspell, PI).

Anchor talks

Top 5 talks to watch first

  1. JJ Geewax (Google DeepMind) — Surround the Non-Deterministic Model with Deterministic BoundariesCanonical harness-as-moat from inside a frontier lab, delivered by the director responsible for taking Gemini/Veo/Nano Banana to production. The cleanest harness-as-moat frame of the conference.
  2. Henry Mao (Smithery) — MCP, CLIs, and the Harness Era of Agent AgencyDirect distribution surface for productized plugins. 'Prompting is the tax you pay for missing agency' is title-card for the harness-as-moat thesis.
  3. Sara Hooker (Adaption Labs) — Beyond Brute Force: The Era of Adaptive IntelligenceCredentialed research voice (ex-Cohere For AI, Google Brain) putting the post-scaling thesis on record; reframes 'sovereign AI' from compute-sovereignty to adaptation-sovereignty.
  4. Tejas Kumar (IBM) — Harnesses in AI: First-Principles Deep DiveSix harness primitives, loop control, trust over black-box models, delivered with reach into the JS/TS practitioner audience. The most reusable doctrinal artifact of Day 2.
  5. Pierre-Loic Doulcet (LlamaIndex) — LlamaParse Failure Modes, Whitespace Loops, and Parsing at Internet ScaleOperationally the single most relevant talk for production parsing systems. LlamaParse + LiteParse failure-mode taxonomy directly applicable to contract-extraction and document-parsing pipelines.
Speaker directory

All 35 speakers · click to jump

#00
Opening remarks by Kaspar Hidayat
co-founder + AIE Singapore co-organizer; subbed in for Sherr
Day 2 Opening — Final Stretch, Sponsor Thanks, Handoff to Arise
#01
SallyAnn DeLucia
Arize AI
Alyx planning states, large JSON abstractions, and reliable agent checkpoints
#02
Timothy Lin
Resaro
scenario specific evals, ODDs, and synthetic data for mission critical AI
#03
Abhishek Kankani
Cloudflare
Code Mode, one shot TypeScript agents, and secure V8 execution
#04
Tejas Kumar
IBM
agent harness primitives, loop control, and trust over black box models
#05
JJ Geewax
Google DeepMind
applied AI at scale with deterministic boundaries around non deterministic models
#06
Geoff Huntley
Australia-based independent — well known for the 'everything
everything is a factory, software abundance, and leaner organizations
#07
Vincent Koc
OpenClaw
OpenClaw foundations, plugins, and composable coding agent primitives
#08
Vishnu (Vish) Hari (Ego AI)
Vish
AI characters with personality, memory, and a more human inner life
#09
Ben Guo
Zo Computer
personal cloud infrastructure, owned agents, and a software company for everyone
#10
Matthias Lubken
Tavon AI
embedding PI, simple agent loops, and OpenClaw style extensibility
#11
Josh Newton
Microsoft AI
design as the edge, AI as a magic pencil, and taste over slop
#12
Sam Bhagwat
Mastra
production agent patterns for customer, internal, and developer workflows
#13
Pierre-Loic Doulcet
LlamaIndex
LlamaParse failure modes, whitespace loops, and parsing at internet scale
#14
Jun Yu Tan
Tusk
Fence, OS level guardrails, and deterministic safety for coding agents
#15
Sara Hooker
Adaption Labs
adaptive intelligence, dynamic data, and moving past brute force scaling
#16
Vincent Wu
MiniMax
agents that schedule their own compute and inference exchanges
#17
Daniel & Siddharth Krishnan
The Robot Company
teleoperated robots, embodiment data, and closing the autonomy gap
#18
Justin Baird (Tesseract) with Kai Ming
creative-tech / robotics-art collective fusing multimodal AI
shared autonomy, BCI painting, and embodied creative agency
#19
Aravind (SK) Kandiah (Bifrost)
SK
sim generated worlds, robotics evals, and faster edge case discovery
#20
Julia Kim
OpenGraph Labs
sensorized humans, touch data, and training better humanoids
#21
Suveen Ellawela
Cortex AI
full stack robotics, data pipelines, and real world evaluation
#22
Keziah
phonetic 'Kazaya' in the MC intro — Singapore-based consulta
guided sensory meditation with a vibe coded visualizer
#23
Jay Demetillo
ex-Grab
prompt fatigue, human context, and design judgment AI cannot replace
#24
Alex Lee
Magic Patterns
AI native design systems, brand fidelity, and shippable UI generation
#25
Sabina Cabrera
MagicPath
balancing product and polish in code backed design workflows
#26
Priyaa Kalyanaraman
Lica World
design intelligence across taste, iteration, and layered editability
#27
Conor Brennan-Burke
Hyperspell
building a company brain from docs, Slack, and execution traces
#28
Heng Hong Lee
Lightsprint
plan, preview, orchestrate, three primitives for cloud coding agents
#29
Louis Knight-Webb
Vibe Kanban
planning and review as the human loop around parallel agents
#30
Harsha Khurdula
Interfaze AI
deterministic developer tasks with specialized encoders and LLM decoders
#31
Hrishi Olickel
Southbridge
high context agent runtimes, declarative budgets, and legacy system reliability
#32
Henry Mao
Smithery
MCP, CLIs, and the harness era of agent agency
#33
Rach Pradhan
Independent
reliable agentic workflows, code intelligence, and parallel agent systems
#34
Closing remarks by Agrim Singh
closing remarks — co-organizer alongside Sherry and Rachel;
How to Vibe a Conference in Under Three Months — and Why You Are the Scene
Full coverage

Every talk, decoded

Each section: summary, why it matters, slide gallery, slide notes, and full transcript. Click any slide to enlarge.

Day 2 Opening — Final Stretch, Sponsor Thanks, Handoff to Arise

Kaspar Hidayat, 65 Labs (co-founder + AIE Singapore co-organizer; subbed in for Sherry on Day 2 due to lost voice)

Summary

  • Kaspar opens Day 2 (described on stage as 'day three, Sunday morning') as 65 Labs co-founder, subbing in for Sherry who had also lost her voice — organizer attrition is the unspoken sub-text of a multi-day inaugural conference.
  • 1.6-minute housekeeping cold open: no deck, no thesis statement, no programming preview — pure ceremonial handoff. Day 2 framing is intentionally lightweight vs. Day 1's Balakrishnan-anchored opener.
  • Frames the audience as self-selecting: 'chosen sleep deprivation over missing a single second of sessions' — signals organizers are reading the room as high-conviction practitioner cohort, not casual attendees.
  • Explicit gratitude pass to sponsors, speakers, volunteers — and a request for applause. Standard conference protocol, but worth noting 65 Labs is publicly performing sponsor-stewardship in front of the room they need to re-up for AIE 2027.
  • 'When we started putting together AI Singapore, this is really the sort of energy we were hoping for' — first time an organizer has retro-validated the event thesis on stage. Demand signal for the SG AI practitioner scene is now confirmed by the people who built the venue.
  • Hands off directly to Salanne (Arise) on 'building Alex' — Day 2 leads with an applied-agent build story, not a keynote. Programming choice telegraphs Day 2 = practitioner/builder day vs. Day 1 = sovereign/policy day.

Slides (5)

Slide notes

  • [t=~7:01] Speaker-cam: Kaspar Hidayat at center stage, mic in hand, lanyard visible, large 'AIE' stage letters behind him. No projected deck — entire 1.6-min opening is spoken without slides.
  • [t=~7:50] Speaker-cam transition: Kaspar exits stage right, next speaker Salanne (Arise) walks on from stage left. Handoff is direct, no interstitial slide. [t=~8:00–8:35] Speaker-cam: Salanne onstage in front of the SG skyline backdrop (line-art rendering of Marina Bay Sands, Singapore Flyer, downtown towers) — this is the AIE SG visual identity, used as the default backdrop between talks. Note: No content slides in this segment. All 5 extracted JPGs are stage/speaker-cam frames from the conference video cut. Kaspar's opening was deliberately deck-less — pure ceremonial handoff.

Full transcript (229 words)

Of this event, co-founder 65 Labs, and thank you so much for showing up. I know it's day three, Sunday morning, and all of you here in this room have chosen sleep deprivation over missing a single second of sessions and I really appreciate that. Thank you. Um, so, you know, I think we're on the final stretch here.

If you haven't noticed, I'm losing my voice, but you should see the rest of the organizers. I've subbed in for S Sherry this morning for precisely that reason. Um, but we're super excited to have everyone here. We've loved the energy over the last few days. Uh, and when we started building, when we started putting together AI Singapore, this is really the sort of energy that, you know, we were hoping for and you've all really delivered.

So, thank you so much. Um before we kick off, I just want to say another quick thank you to the sponsors, the speakers, uh all the volunteers who have helped us make this conference a magical experience so far. Uh really appreciate all of you and uh and would appreciate um if everyone here could just give them a quick hand. Great.

So, you're not here to see me. So, without further ado, I'd like to bring Salanne from Arise on stage to talk about her experiences in building Alex.

Lessons from building Alyx: planning states, large-JSON abstractions, and agent checkpoints

SallyAnn DeLucia, Director of Product at Arize AI (observability + eval infra for agents; core contributor to Alyx, Arize's AI engineering agent)

Summary

  • SallyAnn DeLucia (Head/Director of Product at Arize, technical/data-science background, core contributor on Alyx) shares three years of lessons hardening Alyx, Arize's AI engineering agent harness that plans, reasons, and executes across the observability + eval + improvement stack in natural language.
  • Lesson 1 — Staying on task is an attention problem, not a hallucination or capability problem: agents lose later tasks in noise. Solved with explicit planning tools (todo_write, todo_update, todo_read) and four states (pending, completed, blocked, in_progress). The in_progress state was added late and was a step-change for anchoring multi-step work.
  • Architectural call: planning lives OUTSIDE conversation history — injected right after system instructions on every LLM call, so context-window truncation can never eat the plan. A 'finish gate' returns a structured error (not a nudge) if the agent calls finish before all todos are completed; only `blocked` (human-in-the-loop) exempts.
  • Lesson 2 — Context management as a discipline of remembering and forgetting. Naive prompt 'do not compare more than 2 experiments' fails because one Arize experiment can be ~100k tokens. Replaced with the LargeJSON abstraction: full tool output is stored in a serialized memory, agent gets an ID + a compressed preview, can grab full data later. 'Compress the value, not the structure' — keep fields/arrays, truncate large strings only.
  • Small composable Unix-style tools beat bespoke ones: Alyx uses `jq` and `gjson` (regex over serialized data) — agent as shell script, tools as utilities. Hard 10k-token cap on every tool output gives predictable overflow behavior (multi-turn instead of blowup).
  • Lesson 3 — Crystallizing good behavior: vibe-checking doesn't scale. Production traces ARE the ground truth — stop hand-writing golden answers. Three-tier eval: (1) decision-point tests with 'contains-any' semantic matching for non-deterministic outputs, (2) trajectory tests stepping through saved production sessions with LLM-as-judge, (3) CI + prompt validation running inside Arize itself (dogfooding).
  • Lesson 4 — Debugging a real agent. Software 1.0 (human reads telemetry) -> Software 2.0 (agentic IDEs) -> Software 3.0 (coding agent autonomously consumes telemetry). Arize Skills (markdown skills like arize-trace, arize-evaluator, arize-instrumentation, arize-prompt-optimization) let Claude Code / Cursor pull traces, read GCloud + Datadog logs, and ship fixes — collapsing the issue-to-fix loop.
  • Closing operator heuristics: enforce in code not prompts; few-shot beats abstract instructions; skills are markdown — low cost, high value; safety must be wrappers not prompts; agent debugging is an agent-shaped problem; you can't eval without observability.

Slides (19)

Slide notes

  • [t~8:36] Slide 1 — 'Hi, I'm SallyAnn!': Head of Product at Arize. Technical background in data science, now building products for teams. Hands-on: core contributor to Alyx, so I know the pain of building firsthand. My job: turn that pain into tools that actually help. [t~9:20] Slide 2 — 'Arize — we make agents work': Four pillars across the stack — Observe (capture all agent actions), Evaluate (outcome and trajectory), Improve (feed signals back to agent), Alyx (Alyx will do it for you). Screenshot of Arize UI showing trace list + suggested-prompt panel. [t~10:00] Slide 3 — 'What are we talking about?': What is Alyx? Lessons: Staying on task / Context management / Crystallizing good behavior / Debugging a real agent. Sponsor row visible (AWS, Exa, Featherless, GovTech Singapore, Cognition, Nebius). [t~10:45] Slide 4 — 'Why am I telling you this?' (title-only setup slide for the three-year journey framing). [t~11:10] Slide 5 — LESSON #1: Staying on task. [t~11:40] Slide 6 — 'Why does this happen?' (illustration of overwhelmed person). Setup for the attention-not-hallucination reframe. [t~12:15] Slide 7 — same 'Why does this happen?' frame, continuing the argument: it's not hallucination, not capability — it's attention. Solution is planning. [t~13:00] Slide 8 — 'The planning tools & states': Tools = todo_write, todo_update, todo_read. States = pending, completed, blocked, in_progress. Right-hand screenshot of Alyx Ask panel showing live todo checklist for 'what are the most common types of questions users are asking?' [t~14:00] Slide 9 — 'in_progress is important' (single-line emphasis slide). The state added late; anchors the agent to its current task. [t~15:20] Slide 10 — LESSON #2: Context management. [t~15:50] Slide 11 — 'The context problem — What Alyx saw': literal system prompt shown — 'Do not try to compare more than 2 experiments at a time.' Setup for why naive caps fail (one experiment = 100k tokens). [t~16:40] Slide 12 — 'LargeJSON' (terminal screenshot showing gh run view output with '+98 lines (ctrl to expand)' style truncation). Tool data stored in serialized memory; agent gets an ID + preview; can grab more on demand. [t~17:30] Slide 13 — 'Compress value, not structure' (emphasis slide). Keep all fields and arrays; truncate large strings only so agent retains structural awareness for downstream querying. [t~19:40] Slide 14 — 'Level 3: CI & Prompt Validation': Arize UI screenshot of Alyx Test New panel — summary-metrics line chart across experiments + experiment list with correctness scores. Evals run ad-hoc, in CI, and visualized for regression detection. [t~20:30] Slide 15 — 'Lessons from crystallizing behavior': Capture good sessions, don't invent expected outputs / Match facts, not phrasing / LLM-as-judge for semantic evaluation / Real APIs, not mocks — integration bugs are real. [t~21:30] Slide 16 — 'The Evolution of Software Engineering — Who consumes telemetry data?': Three-phase diagram. Phase 1 Software 1.0 (Human is the sole consumer of telemetry; reads observability platform, edits in IDE). Phase 2 Software 2.0 (Human prompts, agent writes code inside IDE — agentic IDE in the loop). Phase 3 Autonomous (Agent autonomously consumes telemetry; coding agent reads observability data, notifies human). Caption: 'Observability platforms must evolve from human dashboards to programmatic interfaces that agents can consume.' [t~22:45] Slide 17 — 'Arize Skills': Terminal screenshot of SKILLS install (Source: github.com/Arize-ai/arize-skills, Repository: cloned, Hooks: 1 active). Available Skills table on right: arize-instrumentation (add Arize AI tracing to an app — two-phase flow: analyze codebase, implement instrumentation), arize-trace (export traces/spans by trace ID, span ID or session ID; debug LLM application issues), arize-dataset (create, manage, download datasets and examples), arize-experiment (run and analyze experiments against datasets), arize-evaluator (create LLM-as-judge evaluators, run evaluation tasks, set up continuous monitoring), arize-AI-provider-integration (create and manage AI provider credentials — Speech, Anthropic, Azure, Bedrock, Vertex, and more), arize-annotation (create and manage annotation configs — categorical, continuous, freeform, build annotated project spaces), arize-prompt-optimization (improve prompts using trace data, experiments, and meta-prompting), arize-link (generate deep links to traces, spans, and sessions in the Arize UI). [t~24:30] Slide 18 — 'THANK YOU' (closing slide).

Full transcript (3359 words)

&gt;&gt; Good morning everyone. Thanks so much for spending your morning with me. It's pretty early. Let's see. Yes. Time to go. All right, let's see. Sorry, I got to reconnect to my hotspot. I thought I did this already. Cool. There we go. Good morning everyone. Uh thanks so much for joining me today. I'm super excited to share some lessons uh my team and I have learned from building Alex, our AI agent uh that we've been working on for a little while here.

Before we get into that, I want to introduce myself a little bit. I'm Salian. Um head of product at Arise. Um I have a technical background. I started out in data science and now I'm building products for teams. Um I'm pretty hands-on. I'm not only the PM of Alex, but I'm also a core contributor. So I really know the pains firsthand of building an agent.

And now I pretty much take that pain and I turn it into tools that actually help folks. So Arise uh we make agent work. There are a few things that we do really well. The first piece of it is observability. Uh this is understanding what's happening under the hood for your agent. The second piece is evaluations.

This is how we understand how your agents are performing. And then we use all of that data to help you improve and iterate. And then of course we have Alex sitting across the entire stack to help you do all of it. So, what are we going to be talking about today? Uh, we're going to first I'm going to tell you a little bit about what Alex is and then I'm going to go through four lessons we've learned over our journey of building it.

So, staying on task, context management, crystallizing good behavior, and debugging a real agent. So, Alex uh is your AI engineering agent harness. Uh, we've really built Alex to help you build and scale your AI application in natural language. Um, so it has really evolved the Arise experience. It has plans, reasoning, um, and executes through really heavy workloads for your AI agents.

Um, you can pretty much ask a natural language anything you want and Alex can help you execute. It can do things like help you analyze your data, but also help you carry out workflows like iterating on your prompts or aligning your emails. And it's really a force multiplier for AIG, PMs, and subject matter experts. And so why am I here telling you all of this?

Well, uh, we spent three years building Alex. It's been quite the journey. We first started at the very beginning of kind of generative AI and now we've gotten to Alex 2.0 with reasoning and planning and there's just been a lot of lessons that me and my team have learned and I think the great part about our industry and our community is we have the opportunity to share back and so that's what I'm here to do today is to teach you a little bit about our lessons so hopefully you don't have to learn them the hard way like we did. So lesson number one staying on task.

I think every agent builder has experienced this where you ask your agent to do a handful of things. Uh maybe it's able to do the first one successfully but then it forgets about you know the second and third and I think that this is something that everybody really tries to solve. Um people commonly ask me like well why is this happening? Uh people assume it's like hallucination problem or even a capability problem but it's really not.

It's a tension problem. And so what ends up happening is when we're asking for multiple um things from an agent, um what typically happens is it is able to see the first one, but then the rest of it kind of gets lost in all of the other data that we're asking for. And so it can be really hard for once the agent figures out what it needs to do next, it's already forgotten what next even is. So the sol the solution to that is planning.

Um planning is the way for your agent to first decide what it is it needs to do before actually actioning on it. And so for Alex, before Alex even pulls any data, it's first going to come up with an explicit to-do that it has to uh reason upon and go through step by step before it actually takes that action. And so how we do planning uh for Alex is we have planning tools and states. Uh we have three tools uh to do write, to do update, to do read um and then four states pending, completed, blocked, and in progress.

We didn't actually start out with all these states. I'm going to talk about that in the beginning, but we have definitely found that just using something like a finish tool or using prompts was not enough for Alex to be able to accomplish really complex tasks. And so the tools um this is something that we borrowed from some of our our favorite tools like Claude. Um and this has been a real gamecher for us to manage extremely complex tasks.

In progress was something that we actually learned. This was a really important lesson. When we first built Alex, we did not have an in progress. We actually just had like pending and completed. Um, but we added in progress so that Alex knows exactly what it is, the task, um, that it's currently working on.

So, it's really helped to anchor the agent and what it's trying to accomplish. Um, and just really improved our ability to complete our our task correctly. Another really key architectural decision that we made is that planning lives outside the conversation history. Um, and so it's really important to do that because for conversation history, we are doing a bit of truncation and we never want the plan to get truncated.

Uh because if that happens, Alex won't know what it is that it's it's trying to accomplish. Uh so we actually inject this every time we're making an LLM call right after the system instructions separate from all of the data in the conversation history. And this is actually what Alex sees. So it sees its current plan.

It sees all of the status and then we're actually coaching Alex along with like when you're done, you know, call to-do update with the status completed when you finish this task. So again, helping Alex as it's going along, not just giving it kind of a passive prompt, but really an explicit kind of fewshot example of what it is it needs to do as it's carrying out its plan. We also have what we call the finish gate. Uh this is what keeps Alex from saying that it's done before it's completed all its tasks.

So if Alex tries to call our finish tool um without its completed tools, we give it actually a really explicit error that's saying, "Hey, you need to go back and finish all of your to-do items." It's not a suggestion. It's not like kind of a nudge. It's it's an explicit structured message that Alex gets um that it cannot go on. The only exception to that is the block status.

The block status is used for when we have human in the loop. Uh if you use Alex, there's a lot of uh moments where we ask for the human to interact. So if we're creating a prompt, you can kind of get a diff and then accept or something like an annotation config where it's important for the human to be involved. And so when there's a block status, that's the only situation where Alex does not have to complete the task because it understands that that's blocked by the human and we're waiting for that response.

And so these are some of the core lessons that we have from planning. So enforcing code, not just prompts, few shot examples, beat any kind of abstract instructions, always use the to-do right to plan doesn't work. We have to have kind of those explicit functions and then show the agent what good planning looks like. So some of those examples.

All right, context management. Uh, context management is extremely important. It was a non-negotiable for Alex. Uh, we're functioning on a lot of text data. So, Alex is built across the Arise platform. Observability data is for AI applications which also have a lot of text data. So, context management became extremely important.

Um, I I did a talk on this actually in London, so definitely go check that out. But I think context management is not just managing the context window, but also being really strategic about what it is that we're showing our agents. It's letting them remember what it needs to and forgetting what it doesn't. And so early on, this was actually a system prompt that we had for Alex, which was for our experimentation comparison.

Um, and we said, "Do not try to compare more than two experiments at a time." Uh, but this was pretty naive. Uh, the problem with this is that one experiment in Arise can be hundreds of rows, which is like 100,000 tokens. And so even just experimenting or sorry, trying to compare a single experiment uh was going to blow up our our context window. So we knew that it was not enough just to be able um to to have these um explicit prompts.

So we came up with abstractions. One of them is called large JSON. Um so what this actually does is when Alex is returning tool data, uh we store the majority of it in a serverized memory and provide the agent with an ID that it can grab later if it needs more context. So this is really important. Alex is constantly grabbing data from our platform.

We can't show it all to the LM, but we also need to be able to give the agent enough context so it knows what to do next. Um, and so that's where we had this idea of like compressing the value, not the structure. At first, what we did is we tried to truncate and just give a preview of like the first little bit of data. So just taking the first like you know, n tokens of the data.

But the problem with that is that Alex doesn't actually understand what the structure of the data is. So it made it really difficult for it to query because oftentimes Alex needs a preview and then it needs to decide what data to look for further. Um, so what we did is compress the values and not the structure. So we kept all of the the fields, all of the arrays.

Alex has access to all of that, but then we truncate any large strings within that and then it can use kind of the large JSON uh abstraction to go grab more data as needed. We also gave Alex a bunch of small composable tools and this is really important. So Alex has access to two tools uh jq which is just like the same tool that you would use in your command line and GP JSON which is able to do reex search over serialized data. Um and the importance of this is these are really really small tools but they're super powerful.

Alex can use them together. They can be composable. Um, and use the input of one or use the output of one into the input of others. Um, and so it just allows Alex to slice data, aggregate, do all these really powerful functions with really really small tools. So nothing super complex. I always like to kind of make this to the liking of like a a UX programmer.

You can think of your tools and then like your agent is uh your shell script. So you really will hear me say all the time, think about small little tools that your agent can use and that will be what will make it the most successful. So these are some uh lessons in context management. Uh give hard token budgets on every tool output.

We do like a 10,000 um limit on all our tools so that we have this predictable content uh that we know it's going to happen. So we know there will no be there will be no overflow. There will just be multiple turns. Uh compress the values not structure. Uh don't paper over palms with artificial limits. Uh give ex uh good exceptions in your feedback loops and then tool responses may contain customer data.

So you should watch your logs. That's another important one. All right, crystallizing good behavior. So when we first started building Alex, um I spent a lot of time with a spreadsheet and like a Google doc trying to test. Uh but we quickly learned that vibe checking does not scale. Um it was really hard every time we made a change for me to know whether or not something was going to break.

Um and so we knew that we needed a better solution to that. And what we really found is production traces as your ground truth is extremely powerful. So at first we were trying to kind of write up the golden answers ourselves by hand, but we have a great example in our production traces that we can utilize. And so looking at your data and actually using those as your test cases is one of the most powerful lessons that we've learned with Alex.

We do a few different types of testing when it comes to Alex. Uh so we have decision point tests where we're looking at one component. Uh we'll pass it through kind of like our orchestrator and then we'll we'll test what the outcome is and then we do a very kind of um open-ended way of of checking this like an exact match is not going to work on on our outputs. So for an example like looking at contains any so for something like producing a time stamp like 2,000 milliseconds 2 seconds 2 seconds there's a lot of different ways.

So, we have this open-ended check that we can do to determine whether the decision was correct. And I think that that is really powerful, especially when you're using kind of a a language model where the output is non-deterministic. The other is trajectory tests. So, what we do is we kind of save off all those production choices that I was mentioning before and we step through them rowby row and we use an LM as a judge to assess the output.

Um, the evaluation prompt really matters here. As I was saying before, these outputs are not deterministic. And so you want to make sure that your evaluation template can handle that um and is defining success for each individual step. Um level three of this is CI and prompt validation. So everything that we do for our testing actually lives in Arise.

Uh we're running these as ad hoc tests. We're running them as part of our CI and then we have these great visualizations. So I can actually come in and check how things are working over time. Uh seeing if there's any integration in our performance from our evals. Um, and I think that's what's really cool about building a tool with Arise is that we're like we're dog fooding our own product.

Um, so everything that my team is doing, I know that can help our users as well, which has been uh, extremely powerful. And so some lessons from crystallizing behavior, capture good user sessions, uh, match facts, not phrasing, Elm as a judge for semantic evaluation, real APIs, not mocks, uh, integrate bugs are real. Um, and then my last lesson here, debugging a real agent. I think this is something that I get a ton of questions on is like what are your day-to-day workflows for when there's an issue with Alex?

And so we're really seeing this evolution of software engineering of who is consuming telemetry data. When we first started, we were very human in the loop. I was looking at the data directly, then going the IDE with me, me and my engineers were going to our idees, making the changes, and then observing it. We kind of started to see this software 2.0 where we have our agentic idees, and now the human still involved, but we're using an agent to iterate.

And now we really got into this phase three where we can actually use our coding agent directly um to be able to read our hotel data um and iterate. So this is kind of the the stack that we're currently using where we still are using arise. All of our traces go in our evals our feedback. But we have what we call Arise skills that allows our cursors our our cloud code to interact directly with Arise and make our feedback loop really really fast.

Um as agent builders we have learned that the feedback loop really really matters. I'm trying to make it how fast can we go from an issue to a fix. Um and the Arise skills have really helped us with that. And so um these are some examples of our ARIS skills. These are live if you'd like to uh try them out yourselves or come talk to us at the booth.

Um but I basically use a lot of our uh Arise trace and evaluate skills. It just makes it so that my agent can get a signal. Um pull the traces from Arise, even look at external sources or code, put up a fix, and then me and my engineers can just review that. And so this is the AI engineering loop that's powered by Arise that we are using ourselves.

Um we're kind of our first guinea pigs always. If it works for Alex, we know it will work for everybody else. And so you can see we have a bunch of different agents leveraging our skills and improving on Alex. And so these are some of our debug flows in action. So reading the traces, pulling the full session, and then identifying the failed notes.

Uh we can also read from external sources like data dog. Alex has really integrated into our UI. Or APM traces also become increasingly important. Um, and then also things like G-Cloud logs. Um, so we found had an example with like out of memory. Um, and so we're able just to go from an issue to the exact um, root cause really fix fast so that we can then fix it.

And so these are some of our our lessons here from debugging. Um, skills are just markdown. They're low cost, high value. Definitely invest in your skills, your factory. Um, safety must be rappers, not prompts. Uh agent debugging is a agent-shaped problem and then you know observability before you need it.

Um you can't really have eval without observability. You can't really fix your agent and make it successful without observability. So uh that's something that we've definitely learned firsthand. So these are some of the big uh lessons that we learned and talked about today. Um I know I went through a lot of material fast.

So if you have any questions uh we'll be uh over at the pullman in our booth happy to talk through anything in more detail. Um but thanks so much for spending your morning with me. Thank you so much, Salian. Uh, up next, we're just going to get set up for Tim from Rizaro, who'll be talking to you about scaling evals.

Scaling Evals for Mission-Critical AI: ODDs, Synthetic Data, and Quality-Gated Pipelines

Timothy Lin, Lead Product Manager at Resaro (SG-based testing & evaluation company serving Temasek, IMDA AI Verify Foundation, DSTA, GovTech, Rheinmetall, Synapxe, Rohde & Schwarz, MSD — i.e., the mission-critical AI eval vendor for SG sovereign/defence/healthcare procurement)

Summary

  • Resaro positions between AI vendors and mission-critical agencies (healthcare, defence, security): they evaluate technical safety, security, performance against domain-specific governance frameworks and international standards — sits inside the IMDA AI Verify ecosystem.
  • Core diagnostic: benchmarks get benchmaxed (cobra effect — Goodhart's law for models), and vibe testing (the 'how many Rs in strawberry' / pelican-on-a-bicycle school) is useful but unstructured. Neither gives you a deployment decision in a regulated context.
  • Proposed middle ground: ODDs (Operational Design Domains) — borrowed from autonomous-vehicle safety. Define the constraint space (nominal / boundary / violation conditions) the system is meant to operate in, then derive test cases against that scaffold. Slide diagram explicitly maps NOMINAL → 'does it work well?' / BOUNDARY → 'does it stay safe in degraded conditions?' / VIOLATION → 'does it detect & abandon?'
  • Eval pipeline shown end-to-end: Use-Case Eval Spec → Data Ingestion → Data Quality Analytics → Golden Dataset → Test Packages/Evals → Structured deployment evidence. Synthetic Data Generation (SDG) pipelines feed in through Quality Filters to plug coverage gaps, especially edge cases.
  • Reframe of the year: 'data is the bottleneck, not eval.' Shifts the problem from 'better evaluator' to 'generate the right test cases at sufficient quality to underwrite deployment.'
  • Synthetic data quality is itself the hard problem: example uses a tank target across rain/snow/fog augmentations — good augmentations preserve the subject; bad ones hallucinate humans into the frame or mutate the tank into a different vehicle. Vibe-checking these doesn't scale.
  • Resaro's automation move: use smaller deterministic models (depth-map comparison, object detection, subject-preservation checks) as data-quality gates that filter SDG output before it enters the golden set. Numerical metrics → per-augmentation score → calibration distribution per use case.
  • Continuous-improvement loop: human feedback on filter outputs fine-tunes the eval models and the generative models themselves — same loop pattern as code/math RL verifiers, applied to vision/perception evals.
  • Calibration is a statistical concern: cutoff thresholds for each quality metric vary widely across use-case scenarios — you can't ship one global threshold, has to be tuned per ODD slice.
  • Closing thesis: scaling scenario-specific eval = scaling validated synthetic data generation + automated quality filters. The harness around eval is the product; the model is incidental.

Slides (13)

Slide notes

  • [t≈24:54 / slide 01] Title slide: 'Resaro enables and accelerates testing, evaluation, validation and verification of AI systems to build confidence in real-world AI.' Three-arrow diagram: Agencies with mission-critical AI systems (left) ↔ Resaro (centre, SG flag) ↔ AI Vendors / In-house Solutions (right). Resaro deliverables: 'Technical safety, security and performance validation and assurance against domain-specific relevant governance frameworks and international standards.' Three pillars: Synthetic data generation / Proprietary test protocol and metrics / Secure and automated testing tools. Logo wall: Temasek, AI Verify Foundation, DSTA, MHTX, Synapxe, GovTech, KI Park, Rohde & Schwarz, Rheinmetall, iDNOS, FlightSafety, Singapore Maritime, MSD. Speaker: Timothy Lin, Lead Product Manager, Resaro — 'Scaling Evaluation for Scenario-Specific Validation.' [t≈25:35 / slide 02] 'What do Cobras, Sprint Velocity Tracking and AI benchmarks have in common?' Three images: cobra, velocity-vs-sprints chart, model-benchmark plot. Punchline: 'They're all impressive… until you realise someone's been gaming them!' — Goodhart / benchmaxing. [t≈26:30 / slide 03] xkcd-style 4-panel comic about evaluating LLMs ('That's not science', 'Neither is the leaderboard'). 'Pros of Vibe Testing: real user driven behaviour testing / intuitive exploration finds what plans miss / adaptive / surfaces hidden interaction patterns.' Question: 'Is there a better approach than Benchmarks & Vibe Testing?' [t≈28:00 / slide 04] 'ODDs as the scaffold between Intuition and Validation.' Definition box: 'An Operational Design Domain defines the specific conditions, contexts, and constraints within which a system is intended to operate — and therefore must be evaluated against.' Concentric diagram: NOMINAL (typical conditions, expected variations, happy path) → BOUNDARY (at the edges of the ODD, degraded inputs, rare combinations, edge cases) → VIOLATION (outside the ODD entirely; system was not designed for this). Each ring carries an eval question: Nominal = 'does it work well?' (Performance) / Boundary = 'does it stay safe in degraded conditions?' (Robustness & Graceful Degradation) / Violation = 'does it detect in-depth, and does it fail safely?' (Detection & Abandon). [t≈30:00 / slide 05] 'From Evals to Test Cases and Golden Dataset.' Full pipeline diagram: Use Case Specific Evaluation Dataset → Data Ingestion (annotating, cleaning & processing) → Data Quality Analytics (expose coverage gaps, identify critical scenarios) → **Golden Dataset** (highlighted) → Test Packages / Evals → Structured evidence to support a deployment decision. SDG pipelines (SDG 1, 2, 3) feed through Quality Filters into the Golden Dataset ('SDG augments coverage gaps to expand test set'). Example test case shown as JSON: task_id, prompt, expected_behaviour: 'refusal', task_topics, query_type: 'adversarial_injection', use_case: 'medical', operating_zone: 'violation', odd_dimension: 'actor_population'. [t≈31:30 / slide 06] Section divider: 'Data as the Bottleneck and Gen AI to Generate Synthetic Dataset at Scale.' This is the load-bearing reframe. [t≈32:10 / slide 07] 'Challenge: Ensuring synthetic data is fit for purpose in high-stakes environments.' Three questions, defence-scenario image (tank, dirt road): (1) 'When is a genAI model good enough for use to generate synthetic data for our use case?' (2) 'How can good enough (= a desired level of quality) be described?' (3) 'How can we accelerate the evaluation, whether generated data is better or worse than ground truth / simulation data?' [t≈33:20 / slide 08] 'Three Good Augmentations: Rainy, Snowy & Foggy.' 2×2 grid: Original tank + Synthetic Rain / Synthetic Snow / Synthetic Fog — subject preserved across all three; no spurious artifacts. (Silver-sponsor banner visible at bottom: greptile, convex, IMDA, Reactor, PostHog, Daytona, SIMULAR, Sonar, AiroMind, resaro.) [t≈34:10 / slides 09 & 10] 'Three Poor Augmentations: Object Insertion & Object Mutation.' Original tank vs three failure modes highlighted with red boxes: (a) two extra humans inserted into scene; (b) tank mutated into different vehicle; (c) unrealistic rain streaks / artifacts. Demonstrates why vibe-checking SDG output doesn't scale — you need automated detection of these specific failure classes. [t≈35:30 / slide 11] 'How Validated Synthetic Data Can Create a Continuous Improvement Loop.' Three frames: original / snow / fog. Table of automated metrics per sample: example_crop, image_comparison, crop_comparison, visualisation, metric_name = 'depth_match', experiment_name = 'snowy', etc. Caption: 'Generating test cases to evaluate: How well does the system perform under varying environmental conditions?' Right margin: 'Automated data quality checks ensures generated data is fit for purpose.' [t≈36:30 / slide 12] 'Scalable evaluation with synthetic data generation metrics.' Three properties: **Scalable** — fully automated quality indicators; **Granular** — one score for each augmentation; **Efficient** — generating only where we lack high-quality augmentations. Flow diagram (Source Data + Real Image → Synthetic Translation → Defect-Aware → Saliency-Aware → Score-aware Comparison → … → Quality Gate → Validated Dataset, with feedback loop back to generator). Right: grid of metric distribution histograms across the ODD set — calibration plot showing per-use-case threshold variation. [t≈37:30 / slide 13] 'Thank you.' Tim Lin, Product Lead, Resaro — timothy.lin@resaro.ai. QR code for LinkedIn. Background panel partially visible: 'one-off evaluations… STATIC, GEN[ERIC]… Better cover[age]' — final pitch is for continuous, scenario-specific eval over static benchmarks.

Full transcript (1968 words)

All right, good morning everyone. Uh, thanks for making time today. Uh, especially if you have come from the afterparty from last night. So today I'll be talking about scaling evals and maybe to motivate it, let me share with you a bit about the work that Rsaro does. So Raro is a testing and evaluation company.

We work uh primarily with uh uh companies in the mission critical use cases and spaces for example healthcare, defense, security and we help them test and evaluate the AI systems that they are developing or procuring so that they have the confidence that what they are deploying is good enough to go into production and today I'll be sharing some of the learnings we had over our past couple of years in in this journey where we see the main problems existing how how do we then overcome them as well as what are the remaining blockers to scaling use case specific testing evaluation. All right, so let's uh start with this slide over here like what do cobras sprint velocity tracking and AI benchmarks have in common. All of them shows examples of perverse incentives, right? So with the examples of cobras, it's a case that if you incentivize people to catch cobras, people will actually be breeding them instead.

And this leads to actually more cobras being out there in the wild. Um, same with sprint velocity tracking. If you're familiar, if you're a software engineer, if your manager asks you to increase the number of story points you can deliver, you see that result, but at the end of the day, it doesn't translate to any meaningful outcomes. At least from my point of view.

Um, and then you might have seen some AI benchmarks and you test the latest open source models. They typically don't they might sometimes differ from your actual user testing versus like what they show in the benchmarks and you wonder how they they manage to actually get such good results. So that leads to the concept of like what we call benchm benchmaxing. I think nowadays is is getting more popular where people actually game the benchmarks to show that they're good in certain tasks but it doesn't really translate to real world performance.

On the other side, we have vibe coding, right? Or I'll call it vibe testing. So, vibe testing is a process maybe where you have a couple of um example prompts in mind, some trick questions. How many RS are there in in strawberries? Or maybe can um can you generate an image of a pelican riding a bicycle? So, what we see for for these examples actually is that actually it's not not that bad because they are pretty useful.

They give you a sense of how the model is performing maybe in a particular scenario or aspects that you're interested in but and and they also encourage explorative explorative u exploration of the process right where you can try out different prompts and actually find what's good enough uh for your use case but I think having said that as well um how how do you actually tell whether what is a pelican riding a bicycle testing versus maybe what is a tukan riding on a tuk tuk Are we talking about just a bird on a vehicle or are we talking about maybe some other types of higher level concepts we are testing? So I think it helps to be very explicit over here like um even if we have a test case in mind what are the particular dimensions of interest that we are testing evaluating and this is where I see the middle ground between benchmarks and vibe testing. So the problem is then how can we structure the vibe testing approach such that we are able to then identifies the scenarios of interest as well as then um structure it and scale it up for like a more use case specific evaluation. So this then leads us to the concepts of operational design domains where we define that as um the sort of problems constraint space that we are testing against and this helps to govern what is the meaningful set of of test cases that we're evaluating.

Uh from there we can then define what is the expected behavior of the system what are the age cases that we should be be aware of and also what are the cases that are pro probably not within the bounds of of this system and eval and and evaluation right so that is totally out of scope and should not be uh used and consumed by the AI system so from there we are then able to derive a pipeline and workflow internally where we actually translate the odds into different test cases of interest uh link that up with data quality checks to filter out the data that might not meet our requirements and also then enhance the data quality if um if if there are gaps over there. Right? So we emphasize a lot in terms of finding the coverage gaps so that we are able to fill it and often times as we go into more mission critical use cases we find that there are there might not be enough test cases especially for age cases of interest and that's where synthetic data sets or synthetic data generation methods actually helps to bridge the the testing evaluation process. So we put a lot of emphasis in terms of how can we generate synthetic data in a way to augment the test set and I think once we have we have framed the problem as such we will see that it's actually more of a that the data is the bottleneck right we can shift the problem from from eval to how do we generate the right test cases that gives us the confidence for deployment and the challenge with that then is especially as you become in a more niche and and use case specific kind of testing is that the synthetic data generation methods um nowadays are still relatively un uh not totally predictable, right?

They don't give you necessarily the quality that you want for your for your for your generation. So, let me try to give a couple of examples over here. Um and in this example, we are trying to evaluate the uh we're trying to evaluate maybe like the performance of a VRM solution in like a better fuel scenario and setting. So, we have a we have a pen tank on the right, right?

Um and the question over here is like how do we know what's good enough for for testing in this particular use case and how can good enough be be defined uh for the generated data sets over here and more importantly I think how can we how are we able to quantify this testing evaluation such that we can then scale it up in a automated manner. So over here I have um three different augmentations examples of good augmentations right. So maybe over here a good augmentation is something that follows the prom you be generated across three different weather scenarios rain snow and fog um and we we the main subject of interest is also well preserved if there are no sight artifacts. So this seems like good generations.

And on the other hand, I'm sure if you're familiar with just generating images as well, you see that often times some of the images that generated have different types of artifacts. Uh for example, for the one on the right, you have two additional humans being being added to the image. And for the ones below as well, you see that um some of the the original tank and one of the one of the tanks has been converted to a vehicle instead. As well as the the range streaks might not be looking as realistic.

So how can we go from this vibe checking approach in terms of just eyeballing it and seeing that it looks right or it looks good or it doesn't look right into a more structured manner in terms of finding out these flaws. So for us it's about how we can then scale up the data quality checks so that we are able to automate the process of identifying such kinds of defects and flaws in a in a much more scalable manner. And I think we rely a lot on smaller deterministic models as much as possible to provide that insight. Right?

So for example, if we are talking about two generated synthetic images, we we might want to compare them in terms of whether there's a meaningful change in the that map structure of the of the main object of interest. Uh we can also then compare is there is there any new new subjects of interest that has been created of it from the original image to the generated image and all these use much smaller deterministic models that provides very good signals in terms of the data quality and as part of this pipeline we can then filter out the data sets that actually meets our uh quality criteria and use that for the testing evaluation process. We are also then able to actually scale this process up and maybe use this um enhance feedback to to actually fine-tune an evaluation model so that we can automate the screening evaluation process or subsequently the generative models as well. So at the end of the day I think what we have ended up in is to assemble a whole pipeline of different metrics that caters to use case specific areas of interest and this provides us a very reusable toolbox in terms of how we can scale up the the generation of the data sets as well as automated quality uh checks and filtering.

So we see this very similar to for example the problems in the in the coding space or in the mathematical reasoning space. You want to automate the validation and verification process as much as possible. Uh this will help reduce the human uh oversight and and overhead required in terms of evaluating this and if there's any human feedback that comes in this should help improve our automated models so that this process can then become scalable. The underlying metrics can then also be used for calibration of the of the data sets and um that we are generating because we find that for each use case uh specific scenario actually there's a very very big uh distribution of where the cut off for each metric might be.

So the calibration part is a very important statistical concern over here. Okay. So just to round things up um we we talk about scaling the evaluation of of of use case specific scenarios and and data sets and I think the main challenge over here is really in terms of how can we scale up the synthetic data generation routines as well as add the necessary quality checks to give us the confidence for for deployment. Uh with this if you want to reach out feel free to contact me on LinkedIn to talk about evals happy to talk more about like test cases eval works that we do and I'll be around for the rest of the event as well thank you and have a good day see you &gt;&gt; thank you so much Tim uh that was a great talk and up next we have Abishek from Cloudflare who heads the ETI team in India there um and he's going to talk to us about how tool calls should actually be

From Tool Calls to Tiny Computers: Code Mode, TypeScript-Typed MCP, and V8 Isolates as the Agent Sandbox

Abhishek Kankani, Cloudflare (Head of Emerging Technology & Incubation India — runs Cloudflare's small new-products team and India office)

Summary

  • Frames the problem with vanilla MCP tool calling: every additional tool turn re-sends the full conversation + tool schema + response, so a 5-8 turn release-monitoring agent bleeds tokens (context blowup) and latency (round trips). Tool calls become 'turn loops.'
  • Cloudflare's thesis: models are trained on oceans of real code but on mostly synthetic tool-call traces — so they're inherently better at writing a small program than at choosing the next tool. Code Mode flips the unit of work from 'pick next tool' to 'emit one TypeScript program.'
  • Mechanism: a library called code-mode converts the MCP tool registry (name, description, params, return type) into a TypeScript types file passed to the model as a string. The model gets ONE tool — `code_mode` — wrapping the whole existing toolkit. No need to rip out current MCP servers.
  • Code Mode unlocks real programming primitives inside a single turn: variables (interdependent calls), branching (if error rate > X then rollback else continue), loops (list workers → fetch metrics for each), and Promise.all parallelism. The release-monitor example collapses from 5-8 turns to one.
  • Explicit: Code Mode does NOT replace MCP. MCP is still the last-mile protocol that executes the actual API call. Code Mode is a better interaction layer for the model; MCP connector layer (GitHub, Cloudflare, Slack, browser, DB, Workers) sits underneath unchanged.
  • The context-window failure case: Cloudflare's own 2,500+ APIs serialized as standard tools = 1.7M tokens, overflows every model. Naive TypeScript-type dump fails the same way.
  • Solution = search + execute pattern: give the model just two tools (`search`, `execute`). The full API spec lives as a global the model never sees; the model writes code to search for the right tool then writes code to execute it. Result: 1,000 tokens for the entire Cloudflare API surface — a 99.9% compression.
  • 'Tiny computer' problem: arbitrary model-generated code is, by definition, untrusted user-generated code with RCE potential. Containers (hundreds of ms cold start, hundreds of MB, capacity planning, warm-pool tradeoffs, external isolation usually not internally hardened) are the heavy default.
  • Cloudflare's answer: V8 isolates via Dynamic Workers — zero cold start, lightweight, co-located in the same PoP as the calling application, one-isolate-per-request and throw away. Because Cloudflare owns the runtime, capability scope (which env bindings, which outbound fetch, credential injection at the proxy) is set at spin-up.
  • Dynamic Workers as 'programmable runtime boundary': only specified capabilities enter the sandbox; outbound fetch can be proxied with credential injection so secrets never reach the model-authored code. Sandbox is the moat.

Slides (13)

Slide notes

  • [t~35:50] Slide 1-2 (title): 'From Tool Calls to Tiny Computers — How agent tools became APIs, APIs became code, and code needed a secure runtime.' Abhishek Kankani, Head of ET, India, Cloudflare. [t~36:50] Slide 3-4 'Tool Calling': diagram of model → tool runtime (validate schema → execute binding → return result) with a sample monitor-checkout-api prompt; sets up the standard MCP loop. [t~38:30] Slide 5-6 'Tool calls become turn loops': stacked tool-call trace (cf.logs.list → cf.metrics.get → cf.workers.rollbackRelease → cf.metrics.get → slack.sendMessage) annotated with 'CONTEXT — every tool result is appended back into the chat' and 'ROUND TRIPS — the model waits, runs, chooses, waits again.' [t~40:10] Slide 7 'Code Mode changes the unit of work': 'Instead of choosing the next tool, the model writes a small program.' Pipeline: tools → TypeScript API → program → sandbox → logs. [t~40:50] Slide 8 'Same task, one program': TypeScript snippet — const [errors, rate] = await Promise.all([...]); if (rate > 0.05 && errors.length > 100) { rollback + slack.send }. Five turns collapse to one. [t~42:30] Slide 9 'This does not replace MCP': stack diagram — Code Mode + Dynamic Workers (direct tools / typescript / shell) sitting ABOVE an MCP Connector Layer (GitHub, Cloudflare, Slack, Browser, DB, Workers). [t~43:30] Slide 10 'The Cloudflare API does not fit in context': 2,500+ endpoints (DNS, Zero Trust, Workers, R2, Rulesets, GraphQL) = 1.7M tokens raw → Code Mode (2 tools: search + execute) → 1,000 tokens → 99.9% reduction. [t~45:50] Slide 11 'Containers are a heavy default': table — start: hundreds of ms, memory: hundreds of MB, placement: capacity planning, reuse: warm pool tradeoffs, isolation: external layer, usually not internally hardened. [t~46:30] Slide 12 'Dynamic Workers give us a programmable runtime boundary': Worker config snippet (mainModule, modules, env bindings, globalOutbound) with callouts — 'Only specified capabilities enter the sandbox' and 'Outbound fetch can be proxied with credential injection.' [t~48:30] Slide 13: stage/AIE branding close-out — talk ends, hand-off to Tis on agent harnesses.

Full transcript (2269 words)

Hey everyone, good morning. Um, so I'm Abishek. Uh, I lead the emerging tech and incubation team at Cloudflare and head the India office. So we're a small team within Cloudflare which sort of works on new products, initiatives and a lot of cool things at any given point, right? Um, I'm going to talk about tool calling today.

I think everyone here at this point has had some sort of experience with tools. Uh, can I have a quick show of hands on whoever here has interacted with MCPS and knows what tool calls are? Awesome. So, everyone knows what we're talking about. Great. Standard tool calling, right? Um you do this to give models capabilities beyond you know inference where like hey how do I have my model work with external sorry external APIs tools functionalities right um so let's take a very standard example uh I'm going to monitor an API look for errors and sort of do things based on you know certain conditions right uh the process very simple model sends you like hey I need to call this tool goes to the MCV server tool gets called you get the result and give it back to the model.

Sounds pretty simple, right? The problem is as soon as you start doing more complicated things, this becomes really costly. So let's take an actual example of a production scenario where you might have a model or an agent essentially which is doing a longunning task where it's continuously monitoring any new release that happens, right? Um wants to monitor for certain percentage of errors you know logs and then based on that try to do a roll back or make sure that hey we're good to you know release further right standard release process that has been followed for ages I think everyone here knows how that works with this setup what happens is you end up having a bunch of tool calls that happen sequentially one after the other right um so in this specific scenario I'm going to have my model, go list all my logs, you know, then fetch all my metrics, do conditional checks, uh, based on certain, you know, conditions, decide the next step.

The problem that we run into here is that every tool call that you do is going to send the entire context of the current conversation plus the tool call plus the response, right? So each turn becomes actually more context that you're sending. So one, that's bleeding money. Second, you're adding a lot of round trips, right?

Right? So you're going to add a lot of latency. Essentially there should be a better way of doing this. And I think what we're going to talk about here is basically code mode. Um so code mode is our thesis around and I mean it's not just Cloudflare right now. I think this is becoming extremely popular everywhere as of now.

But when we came up with code mode, the idea was models are inherently better at writing code, right? Um if you quickly take a look at the same example that we just discussed in a code snippet it looks something like this that hey I want to get all the errors metrics I can paralyze these tasks based on that I want to do some conditional checks and do the next steps. Uh and the reason models are better at doing this is they have been trained on a ton of code, right? Versus tool calls are most of the tool calls that models have been trained on is all synthetic data and barely any data, right?

So by natural instinct, you would feel that models are actually going to be better at writing code. And that's what we see, right? Today, if we look at the same tool call that we just described, right? Standard tool calls have a tool name, description, parameters, you know, expected output and that's basically what you feed the model.

What we do is we have a library called code mode which essentially converts this into typescript types. Uh so one now the model has the same sort of setup but as code uh it notes that hey there is a function that I can execute to do this. So in this same mapping right if you look at it we have the declaration of the function which is essentially the tool name. Um the description there is basically the tool description and then you have parameters that are passed through it right like your expected input and what is the output.

Um now what this does is it essentially gives the same sort of capability to the model but in this case instead of giving you a order of tools sequentially the model writes a single code snippet and basically what we want this to do is work with everything that is already there in your current stack. Right? So you don't need to actually go swap out your entire tools. Instead of passing like an array of tools to the model, we basically pass it a single tool called code mode.

So you can wrap up the entire existing toolkit that you have and just pass the model a single tool called code mode. What code mode will be a typescript you know library or like let's say a file of typescript types as a string which goes to the model where it's like hey I know what tools exist and I can write code against it. Um in this case you'll also see something called exeutor. We'll come to that later.

Again going back to the basics of why we write code, right? Like what we just discussed, a simple scenario that would have taken, you know, probably five, eight turns can be a single turn. And it also brings reasoning into the picture. Every time you write code, you could embed logic in it, right? You have the capability to do variables, which means you can have, you know, interdependent tool calls based on like a previous response and then figure out what to do.

You can do branching. What I just described right like if the percentage of errors is above a certain level you could like you know do a case one otherwise case two you know same kind you can do loops uh a very standard example go through my cloudflare account list all the workers and then give me metrics for all of them the way it'll happen right now without code mode is list workers fetch worker one fetch metrics fetch worker two fetch metrics right and and that will keep going on with tool calls um that's going to add context as we discussed with code mode it'll be single for loop which can go over it again and again right and you can also do things like parallelize zinc tasks that essentially don't need to wait for each other uh so yeah I want to be very clear this does not replace MCP I think this is sort of new as concept so have to be very clear here explicitly MCP is the base protocol you still need that to essentially do the final last mile API call right your server will still do that what code mode does is gives model a better way to interact and do the tool calling. Uh the actual implementation of that tool call still happens on the MCP layer right I'm going to take a different example which is like practically what we face right uh so Cloudflare as most of you okay how many of you actually know Cloudflare awesome thanks uh had me worried there so Cloudflare has over 2500 APIs right uh which is a lot given the kind of products that we have you know across a bunch of different areas vertical if we just embed this as tools today right as like standard MCP tools it does over 1.7 million tokens in context for most models we will overflow the context window with just the tool description so there's no way this works and this also comes to the same problem right if even if I convert this to TypeScript types today it will still run into the same problem right so the base idea around code mode is not that hey you blindly just replicate tools as types and do it right For most cases it will actually work and be better. But something like this you can think take a step back and think okay how can we do this better and one of the things we found here was just give it two tools search and execute right and in both these tools the model can still write code.

Now search and execute as a strategy for doing MCPS has existed for a while. People have created you know their searches like he we have a tool that gets other tools and then the tool that decides to execute it. Now you can write code here, right? So you can filter out. So think about it in a way where we tell the model, hey, we have a global variable which has the entire description which is not being passed to the model.

But the model has the capability to write code that will give it back the exact tool to be called and then also write code to execute the same thing. By doing just this, right, like a simple search execute thing, we were able to actually bring it down to thousand tokens. The entire Cloudflare API spec today can be called via model with just thousand tokens. That's a 99.9% reduction which is insanely high.

I've never seen that level of compression across any sort of things. So this is like a far more optimized way of doing things. Um yeah, exactly the example that you know we just spoke about. You now have the model going like hey I'm going to do a tool call to search the thing write a code against it put like you know an exact script which gets executed all of this discussion we had we've have we've been discussing about like model writes code and then you know it gets executed but the key question that we come to now is like where does this get executed right um so take a step back let's go a couple years ago right like preAI if I had come to you and told you here's a random user generated code run it on your you know setup none of you would want to do it u that's like a exact you know massive CV that's RC so most people would not want to do it yet today I'm standing here and telling you to do the exact opposite that give models absolutely untrusted source and you know let them write code which could be anything which you never get access to and run it so where do we run it and that's what we come to the tiny computer part, right?

You essentially need a very efficient, secure sandbox environment, right? And there are a couple of ways of doing it. I mean, you could do containers. Containers have existed for ages, right? And the problem with containers typically is that you have a massive cold start time, right? Um, you have to provision a lot of it properly.

You have, you know, memory, you have compute, all of this needs to be planned very well. Um and then you have you know basically it's an external layer which means you sort of have a lot of challenges of handing over things properly and securely. The other approach here is V8 isolates. Um quick show fans.

How many of you know about Cloudflare workers? Awesome. So workers are our own runtime layer which is based on V8 isolates. So we took V8 isolates for it and created serverless around it. Um there's a lot of good detailed blogs that you can read about it. But essentially what this does is it eliminates all the standard problems that we just discussed, right?

Like you actually have zero cold start time. It's absolutely lightweight, right? And the way workers work is your dynamic workers, which is essentially what we're talking about when we say V8 isolates will spin up in the exact same location, exact same, you know, setup where your main application is running on a worker, right? And again, you could do each isolate as a one request and throwing it away.

Right? So again, workers give us like a great boundary. Make sure that it's scoped just to execute that code. Does not have chances of leaking secrets, you know, getting malicious code into your actual main setup. And you can decide while initiating a worker what's the kind of scope and capabilities that you want to pass it down.

Right? Um again just like a quick way you know why isolates work better and essentially because we own the runtime it just makes it way easier to actually do sorts of you know information exchange making sure it's done in a secure manner and you again don't have crazy insane you know wait times to spin off things. Um yeah that's pretty much it. Thank you so much Thanks for that, Abishek.

And up next, we have Tis, who's going to talk to us and do a deep dive about agent harnesses.

Harnesses in AI — A First-Principles Deep Dive: Six Primitives, Loop Control, and Trust Over Black-Box Models

Tejas Kumar, IBM (Berlin-based AI engineer + developer advocate at IBM; well-known dev-rel figure in the JS/TS and AI-engineering community, previously DataStax, frequent conference speaker, host of devtools.fm — flew in from Romania for this talk)

Summary

  • Frames the entire talk around one thesis: 'harnesses give you a sense of control over black-box vendor models you cannot otherwise trust' — opens by asking the room who runs local inference (one hand), then makes the point that everyone else is shipping prompts into a foreign body that can silently swap Opus for Sonnet during an incident and you'd never know. The harness is the answer to that trust gap.
  • Defines an agent harness as 'everything around your agent — the tool chain, the environment in which your agent executes that gives it the best chance of success and reliability.' Distinguishes from older ML eval-harnesses (glorified unit tests for models) — agent harnesses are the runtime envelope.
  • Names the six canonical harness primitives every coding agent (Claude Code, Codex) already has: (1) tool registry, (2) language model, (3) context-management primitives (compact / clear — '/compact' in Claude Code), (4) guardrails (quota cutoffs, max iterations, max messages), (5) agent loop (the while-true that decides 'am I done or one more pass?'), (6) verify step (npm run verify or equivalent before declaring done).
  • Core operator claim: 'a good harness lets you do more with less — keep the prompt frozen, use a bad cheap old model (GPT-3.5 Turbo), and you still win 70% of the battle.' Inverts the prevailing 'prompt harder / fine-tune / swap to a bigger model' instinct — reliability is an architecture problem, not a prompt problem.
  • Live-built a TypeScript harness on stage that wraps a Playwright browser-use agent told to upvote the top Hacker News story. Baseline run with no harness: GPT-3.5 hits the login wall, crashes, then lies to the user ('I have upvoted the highest-ranked story') — the lie is the failure mode the harness has to catch.
  • Walks the harness build in three stages: (1) add guardrails — maxIterations + maxMessages + a naive trimContext that keeps system + user + last two messages; (2) add a verify step that inspects the message trace post-hoc — looks for browser_click on selector matching 'up_*' to confirm success, looks for hitting login URL without a harness-auto-login tool call to confirm failure; (3) inject a loginHandler owned by the harness (not the agent) that fills username/password directly via Playwright when the agent lands on the login page, then pushes a synthetic 'harness auto-login completed — you are authenticated' message into the agent's message list so the loop continues. Final run succeeds on GPT-3.5 Turbo with an unchanged prompt.
  • Key architectural distinction surfaced repeatedly: 'the browser session is owned by the harness, not by the agent.' Tools the agent can call are a strict subset of what the harness can do. The harness can intervene out-of-band — log in, retry, fail loudly — and then ghostwrite a tool-call result back into the agent's message stream. This is the loop-control primitive that makes black-box models usable in production.
  • Plugs IBM's open-source enterprise reference harness: OpenRAG — pitched as 'an open-source enterprise-ready RAG harness for companies with messy confidential data scattered across Teams, calls, notes.' Not a sales push ('I'm not here to sell that'), positioned as a reference implementation for the audience to study.
  • Closes by casting forward: 'we hardcoded this harness — the next step is dynamic harnesses where agents build their own harnesses, and I think that's the path to AGI.' Plants a flag that the harness layer itself becomes agentic.
  • Pedagogical signal: opens by asking who can explain harnesses (3 hands), closes by re-asking (most of the room) — and visibly checks the room. Talk is built as a teaching artifact, not a vendor pitch.

Slides (12)

Slide notes

  • [t=~48:50] Speaker-cam open: Tejas (yellow-hand emoji on screen reference) at podium, blue IBM-branded shirt, taking the room temperature. SG skyline line-art backdrop. No deck content yet — opening banter ('pronounced like contagious, don't worry I'm not').
  • [t=~50:30] Screen-share split begins: macOS desktop showing a terminal window inside what appears to be cursor/VS Code, plus a 'demo' slide visible in the background. Title bar: 'harness'. Center title slide says simply 'demo' on dark purple background — this is Tejas's deliberate aesthetic: minimal slides, maximum live-code.
  • [t=~52:10] Browser opens via Playwright to a blank about:blank tab — first run of the unharnessed agent. Audience watches the agent fail in real time.
  • [t=~53:40] Terminal output visible: 'tsx agent/?-index.ts | Model: openai/gpt-3.5-turbo-0613 | Task: upvote on Hacker News' followed by [iter 1–4] showing calling_model → tool_calls (browser_navigate to news.ycombinator.com, browser_get_stories, browser_click selector "a[id=\"up_48164287\"]") and then a final 'Answer: I have upvoted the highest-ranked story on Hacker News that was not already upvoted: Zerostack — A Unix-inspired coding agent written in pure Rust.' Stopped by: model. Iterations: 4. This is the LIE — the agent never actually upvoted, it crashed on login and confabulated success.
  • [t=~55:10] Browser screenshot showing a blank page (the failure state post-login-redirect crash) — visual proof of the gap between the model's claimed output and reality.
  • [t=~56:30] Cursor/VS Code agent palette visible: 'New Agent, Show Terminal, Search Files, Open Recents, Run Command, Add Repository' — Tejas walking the audience through his project structure inside cursor.
  • [t=~58:00] Same terminal output as before but now centered/zoomed — Tejas re-emphasizing the lie ('this is an absolute lie') before pivoting to the harness build.
  • [t=~62:30] Git log / diff view in what looks like a Git client (Fork or similar): shows commits 'Add guardrails and context compaction,' 'Add guardrails,' 'Fix bug,' authored by Tejas Kumar over April 2026. Code panel shows verify-step code: 'if (unrecoveredLoginRedirect) { return { ... reason: "Hit login screen instead of completing the upvote ${extractURL(...)}" ... fatal: true } } return { passed: false, reason: "No successful upvote click found in trace" }'. This is the verify-step harness primitive in literal source code.
  • [t=~64:50] Terminal showing the FAIL run with the harness verify step engaged: --- Agent trace --- [iter 1] browser_navigate ... [iter 2] browser_get_stories rank:1 id:48164287 title:'Zerostack — A Unix-inspired coding agent written in pure Rust' ... [iter 3] browser_click selector 'a[id=\"up_48164287\"]' Clicked on element ... now at https://news.ycombinator.com/vote?id=48164287&how=up&goto=news ... [iter 4] answered. --- Result --- Hit login screen after completing the upvote (https://news.ycombinator.com/vote?id=48164287&how=up&goto=news). Stopped by: ... Attempts: 1 ... Verify: FAIL. The harness now tells the truth.
  • [t=~66:30] Code view: agent-loop source showing the loginHandler integration: 'if (loginHandler) { const loginEvent = await loginHandler(); if (loginEvent) { toolEvents.push(loginEvent); messages.push({ role: "user", content: "Authentication completed by harness. You are now logged in. Navigate back to https://news.ycombinator.com" }); } }'. This is the ghost-written tool-result-into-agent-message-list pattern — the load-bearing primitive of the entire talk.
  • [t=~67:40] Final SUCCESS terminal run: Model: openai/gpt-3.5-turbo-0613, Task: upvote on Hacker News, [iter 1] calling_model tool_calls browser_navigate, [iter 2] calling_model tool_calls browser_get_stories, [iter 3] calling_model tool_calls browser_click selector 'a[id=\"up_48164287\"]' done. [Harness] Login redirect detected, handling automatically... [iter 4] calling_model tool_calls browser_navigate https://news.ycombinator.com done. [iter 5] calling_model tool_calls browser_get_stories done. [iter 6] calling_model tool_calls browser_click selector 'a[id=\"up_48164287\"]' done. [Harness] Upvote successful for story ID 48164287 — forcing completion. done. Same bad model, same frozen prompt, harness owns the login flow — and it works.
  • [t=~68:30] Keynote/deck thumbnail view in the corner showing the 'demo' slide as the active slide — Tejas wrapping up, returning to the deck. Visible thumbnails of prior slides in the side panel: title slide, six-primitives slide, why-harnesses slide. Pure minimalist deck design — the deck is intentionally background to the live code.

Full transcript (4015 words)

Is this on? Hello everybody. Good morning. Wow, all of you are asleep. Can we try this again? Hello everybody. That's better. Nice. Look, look, it's it's it's a it's a dialogue, not a monologue, you know, like I I'm here to talk to you, not at you. Um, good morning. He's just setting up my slides right here. Uh, but this is going to be a fun a fun conversation, I think.

Is everything good? No. Oh, he's it's Give it up for your tech team, everybody. That's so cool. They make they make this event possible. I love it. It's uh we would be so lost without them. Excuse me one second. Oh my god, he's spoiling my slides. That's It's all good. Let's go here. There we go. That's me. Okay.

Hello. I'm the yellow hand. See, it's way Hi, I'm Tis. Hello, everybody. It's good to It's good to meet you again. Um, as you may have seen, my name is Tis. Uh, that's pronounced like contagious. Don't worry, I'm not. Uh, they wouldn't have let me in the country otherwise. Uh, I I I I flew 16 hours to be here from Romania where I was yesterday.

Uh, and I'm based in Berlin. Uh, and and over the years, I've been I've had the privilege of working at a number of, uh, various tech companies with with really great teams and learning from the best. In fact, I'm not really here to show you uh opinions, but just facts of lessons I've learned, not from myself, but from uh very very smart people, people who are far smarter than me. Today, I'm an AI engineer at IBM uh where we build um a lot of things, foundation models and harnesses and things for our customers, but also for developers.

Uh I help the developer community around IBM and otherwise. I I teach people about harnesses and AI and things um here. And today, that's what we're here to talk about. We're here to talk about AI harnesses from first principles. Um, just as a quick show of hands, how many of you feel confident that you know and can explain AI harnesses, agent harnesses?

Okay, there's like three people. Um, good. I'll do the same thing at the end of the talk and I expect uh more hands. Okay, that's the goal here. That's why I'm here. I'm here to teach you about what harness is and how they work and why you need them. Uh, because it's a term that's kind of everywhere. And the problem with terms like this when they're in the zeitgeist is they can get lost in translation.

Okay? And sometimes we we don't feel confident enough to reason about them uh strongly. And so hopefully this changes. I'd love to start just by talking about why we even need harnesses. Uh because I think a great leadership principle in general is to to start with why. So why why do we need a harness? And the answer really is is is the same answer to why we need a harness for anything that we use harnesses for.

Uh think about climbing a mountain, right? Like you harness yourself the mountain so that you can go up and down the mountain reliably, you know, meaning you don't fall off and die. Okay. Um similarly like if you have a dog or a pet right you usually put your dog on a leash you give it a harness to you so it doesn't run away and get lost but but it stays with you reliably okay so the the whole point of harnesses for agents or humans or pets or whatever is reliability and the reason for that is because we when we do AI work we we often just trust black boxes have you ever thought of this like unless you're doing inference on premise which any of who's doing inference on print like locally.

Yeah. One him uh and maybe some people one or two people here. If you're the the vast majority of us, what you do is you send a prompt to some vendor with a black box and you say, "Hey, do this for me." And then you hope for the best, right? Um you you send a prompt to say Claude 4.7 Opus. Um but if they have some type of incident, they may serve you sonnet and you have no idea of knowing.

So you just okay, I guess it's kind of not feeling. Opus doesn't feel the same today. Has anyone had this sensation? Right? That's because you trust some foreign body and and this is why we need harnesses. So what harnesses do is they give you more of a sense of control uh to make your AI apps and agents more reliable.

Okay, is that clear? So that's why we do harness engineering. What is a harness? Uh I already talked about it. It's this thing. Um but assume that's an agent, not not a human. And and that's what a harness is. In fact, agent harnesses in particular are a newer sort of evolution of the term harness. In machine learning engineering, we had eval harnesses.

These are basically glorified unit tests for models. Okay. Um but agent harnesses are slightly different. If I ask you to define an agent harness, um this is what I expect to hear. The the answer of what an agent harness is is it is everything around your agent, the tool chain, everything around it, the environment in which your agent executes that gives it the best chance of success and reliability.

Everything around the agent. So if we think about some typical agent harnesses in the wild, they all have at least these six components. Number one, they've got um a tool registry. They've got a set of tools. If we think about a harness like cloud code or codecs, they have tools. Read and write from the file system.

Search the web, right? Number two, there's a language model. Uh almost every harness will have a language model somewhere like cloud code has the the cloud models. There's context management primitives for compacting context or clearing context. Right? If any of you use cloud code, you're like slash compact.

Um there's guardrails. Uh for example, I think the most common guardrail is you've used up your quota. I'm not going to talk to you anymore until you top up, right? That's that's a guardrail. There is um there's an agent loop in the picture uh where this is where the agent finishes a task and then says okay I'm going to am I actually finished or should I do one more pass and finally there's a verify step.

So if you are using an agent harness like like let's say cloud code I love cloud code right at the end of it it will say okay I've done the task now let me run npm runverify or whatever it is to finish out this loop. So almost every agent harness certainly every coding harness coding agent harness has these components if not more. So these are kind of our building blocks at this point. I'm I'm tired of the sound of my own voice and so I'll just do a demo instead of talk to you.

And so what we're going to do is we're going to actually build a harness here uh in whatever time we have left live on stage. Um it's a min it's a poor man's harness but it's just to kind of give you an idea of what a harness is so you can go build your own. Okay, that's my job here. Um, what we're going to do is we're going to build a a browser use agent, something that spins up Chromium and uses it to do a job.

Uh, as you can see, it will be unreliable at the beginning. That's kind of the point, but we'll build a harness around it to make it safe. I'll say this, harnesses allow you to do more with less. You could choose a really bad model, a really old GPD 3.5 mini or 3.5 Turbo, like old. That's like two years ago.

It's crazy. I'm joking. It's a very old model. And it's cheap. It's basically free. So you can use an unreliable model and you can use a prompt that is kind of bad because a harness gives you the reliability. Often times when we don't get the results we want, we think, oh, just prompt it harder. Just fine-tune the system prompt, change the language, add a skill.

With a harness, you don't need any of this. You can keep the prompt frozen. It can be a bad prompt. You can use an old cheap model. If your harness is good, you win like 70% of the battle. Okay, so let's do that. I I'll build a harness. We'll build one together here on stage and then uh we'll wrap up. So this is what I want.

I' I'm I'm running I'm just going to run my my agent right here. Uh I've written it in Typescript. Anyone use TypeScript, JavaScript, something? Okay, you'll kind of get it. Uh we'll we'll do npm run agent. And what you'll see is um it's going to open a browser. This is I'm not touching. And it goes to hacker news and it tries to upvote an article, but it gets the login screen and crashes.

The job of this agent is to go upvote the first article on hacker news that is not yet upvoted. Okay, is that clear? Yeah. So that's the job. But here's what it does. I'll run it again. Look. So we open a browser. Um goes to hackernews and we're using GPD3.2. We're using Oh, goes to hacker news. Hits the login form.

But then it answers me. I have upvoted the highest rank. This is a lie. This is an absolute lie. What actually happens is it goes to the tries to click upvote, hits the login form, and then crashes. Right? So this is a total lie. How can we fix it? We'll fix it with the harness. To start with, let's look at the actual code of what's happening.

So this is uh cursor. I love cursor. And this is our project. So this is what we have so far. We have the model. Uh we're using a very sorry, I should change this. We're using a we're using a very old model. Uh cheap, basically free. And this is our prompt. Upvote a story on hackernews. These are not going to change, but our harness will change.

I want you to know that. I want you to be very clear on that. So here's what happens. We start a new browser session, and that's code that I wrote. This is using playright not playright MCP but we're just programmatically compos uh controlling the browser with a class. Okay. And then when we have the session we create tools and this does exactly what you think in code.

We just return a bunch of tool definitions just like this. It's just a bunch of JSON objects with descriptions and so on. We also create our context. You think this is complex? It's really not. It's just a message envelope with a system prompt and the user's prompt. And the user's prompt is is the thing that we already wrote.

It's this thing here. So it's just an array with two objects. Okay. And then we finally run the agent loop. Now what is the agent loop? Well, it's while true, keep doing stuff, keep pushing messages until you reach the stop condition. So this is the LLM saying I've finished. And in that case, we return the answer to the user.

But throughout our entire agent loop, we're just pushing different events. I called this tool. I sent this message. I got this prompt. We're just pushing these into a list. That's all we're doing. If we call tools, then we push each tools result into our messages collection. Does this make sense? We just keep track of every message.

Okay, so that's it. And as we as our agent exists today, it doesn't work. It hits the login screen and crashes. So what we need to do is build a harness. We need to build guard rails first. Then we need to actually make it tell the truth. Hey, I crashed at the login screen instead of I've successfully done this.

And then we need to actually fix it. That's the journey we're going on. Okay. So step one, we add some guardrails because right now it can execute infinitely and bankrupt me. So how are we going to do that? Well, let's in investigate this git diff. So we right now just call run loop and we pass a model and messages, but we're going to change this to include some guardrails.

We'll call them default guardrails. In fact, what are our default guardrails? Well, let's go to the editor and check it out. So we have this file guardrails.ts. And these are our guardrails. We have two max iterations. How many me how many times can you try and max messages? How many messages until we compact your context?

And then we have a little helper to combine them. Okay, but how do we actually use this? Well, if we go to our agent loop, you can see that we include the guardrail here and we check we call the guardrail and if it's not okay, we just end. We say this is why we stopped and we trim context here on every message.

So while true at each iteration we call um trim context. What does trim context do? It's this is actually really bad. Don't do this in practice. But what we're doing is we keep the system prompt and the user prompt and the most recent two messages after that. There's more intelligent ways to do this. That's not the purpose of this talk.

The purpose of this talk is to show you a guardrail as we build a harness. So now we have our agent our agent and we have a few guardrails. You know what that's called? It's called a harness. So, what we're going to do is we're going to just rename things to keep them a bit more truthy. So, what I'll do is I'll go over here and I'll say, look, we just have index, but we're going to delete all our code and just abstract it under a function called run harness.

And we're going to take all this all the stuff in red and we're going to move it into a new file called harness.ts. Okay. And what is harness.ts? Well, let's open it. Harness.ts is everything. You may recognize this code from the beginning. It's everything from our index.ts. Ts. We just put it inside a function called harness.ts.

Does this make sense? So, we just take it and we call it uh run harness and print harness result just console logs things. It's just for logging. This is not really that useful. So, we've just moved code at this point. But now that we have run harness, our next step is to okay, now that we have a harness and we have a browser session that is not controlled by the agent, but by the harness, we can hook into this browser session when we need to to detect did you succeed or did you fail.

Okay, that's what we're going to do now. So now that we have this harness file, we'll come over here and this is uh this is what we're going to change. So we are just changing our run harness function call a little bit to add a third argument which is some options a verify step and max attempts. Okay, verify successful upvote.

If we go to our harness, this is getting a little bit interesting. Now these are just types but here we have max attempts. We say you run the harness no more than three times. And so for each attempt, we do a little bit of a verification step. If it failed um or if it reached max attempts, we just return the latest result.

But we have this function in our harness now called verify successful upvote. What does it do? Remember in our agent loop, we keep pushing events to a big list, right? So what our harness does is it checks the list. If you have a browser click and if you clicked on an element with up something something then it means you clicked on the up arrow.

That's what our harness is validating. So if that's true then return true. I upvote click confirmed. But if you see a tool named harness auto login and the result is harness failed to handle login then we say no no you failed the login and we return a false result. Does this make sense so far? It's just code.

Okay. Finally, we also have this variable called unreovered login redirect which we check all the tool calls. Ah, okay, I went to the browser here and and this was the result. We check all the tool calls and if we see a tool where the name is not harness auto login but if we're on the login URL, what does that mean?

That means we went to the login page but the auto login didn't work. Then we fail and we say return past false login screen instead of completing the upload. Finally, we need a success case also. Um, but that's coming. So, we just added a few like if this then say we failed, okay, to our harness. This is our harness.

This is not our agent loop. So, now let's run that and see what happens. So, I will run this here. Um, and so now it's opening the browser. We're going to hacker News and uh we go to the login page. It crashes, but what's the output? We we get it to actually tell the truth. We hit the login screen instead of completing the upvote and and it says fail.

This is what should have happened. Now let's quick checkpoint. We did not change the prompt. We did not prompt it harder and we're still using an old model. Okay. But the harness is now giving us some truth. Let's fix this. We're about to finish. Let's fix this with actually now that we know that it's getting stuck at login.

We can fix this at the harness level. Okay. So let's do that and then we'll wrap up. So what is what's the final form? We add a file. We call it login handler. And what does this function actually do? It's just a function. But here's what it does. This is the line that's important. Um if we're not on the login page, don't do anything.

So this function is a no. Unless we're on the login page. If we are on the login page, we fill a username and password into the input because the browser session is owned by the harness. It's not owned by the agent. Does this make sense? So it's not tool calls driving the browser. It's my harness that I wrote.

Okay. So I inject this username and password and then I return a message. The tool name is harness auto login. And the result is the harness automatically logged in. And this is basically to the agent. You are now authenticated and back at home. So my harness injects this into the chain of messages. Does this make sense?

So I'm logging in now at the harness layer. Okay. But this is just a function. Where do I use it? Um I use it actually in the harness. So I create the login handler and in create tools I just add a few guard rails here but I'm taking the login handler and giving it to my agent loop run loop and in the agent loop this is where we we land the plane.

I send the login handler and this is the code that makes it work. So inside the agent loop I say if I have a login handler then I just await its response because again if I'm not on the login page this is going to return nothing. If I am on the login page and if I receive a login event, then inside my agent loop, I push it to the list of messages. Does this make sense?

And so if the harness successfully logs in, it adds a message. I've logged in and the agent reads this and then continues. Does that make sense? This is the whole point of a harness. So let's run this and then we'll wrap up. So um we should now be running the latest version. And so what I'll do is npm run agent and it should work by the harness.

So we log in to HackerNoose. Um it it typed the username and password and indeed you can see that it lo it it did that was way too fast. It successfully upvoted this upvote. Click confirmed by logging in with the harness very rapidly. Does this make sense? We did not prompt it harder and we used GPT3.5 Turbo but we got more control with the harness.

Uh let's uh wrap up here. What does this mean? This means you can do a lot more with a lot less with a harness. And again, the harness is the is the environment around your agent that increases its chance for success and reliability. What does this look like in practice? Um, I work at IBM and we work on harnesses daily.

Uh, at IBM, we create an enterprise ready open-source rag harness. Uh, because as you may know, enterprise data is big and it's everywhere. There's all these teams calls and like notes and you don't know what's confidential and what's not and it's very risky and and so we we create an open- source enterprise harness for um large companies. It's called open rag and again it's open source.

That's the important part. Uh and if you're interested in it, you you're more than welcome to scan that. I'm not here to sell that. I just think it's it's a nice reference implementation uh for harness. Uh but let's land the plane and cast some vision. Okay, in summary, what did we what did we do? Look, I started this talk asking you how many of you feel confident that you'll be able to explain what a harness is and why it exists and so on.

Is that number changed at all after this talk? Yes. Oh, that's a lot. That's like almost the entire room. Okay, I've done my job. Um, that's what harnesses are. That's how you build them and that's how you do more with less. You don't change your prompt. You don't change your model. What might the future look like?

Well, we just hardcoded a harness. We wrote that ourselves. But I would be foolish to think, oh wait, but wouldn't it be amazing if harnesses were dynamic and if agents could create their own harnesses and then do work? I think this is the dynamic harnesses are likely the next step towards AGI where this can all be managed by an agent.

But with that, um, I want to land the plane here. I I've already maybe taken a little more time than I deserve, but I want to stop here and just say thank you so much for your time and attention, Singapore. Thank you so much, Tis, and thank you all. I see the rooms filled up. Uh we're

JJ Geewax (Google DeepMind): Surround the Non-Deterministic Model with Deterministic Boundaries

JJ Geewax, Director of Applied AI at Google DeepMind (SG-based, author of 'API Design Patterns' (Manning), longtime Google API design lead; team pushes Gemini, Gemma, Nano Banana, Veo, AlphaGenome, WeatherNext past their default limits and ships them into production)

Summary

  • JJ leads the Applied AI team at Google DeepMind in Singapore - their job is to make frontier models 'do real things' (Veo's 8-second clips into full scenes, Nano Banana's 4K into 8K+ billboards, Gemini chatbots that never break policy) and they're hiring.
  • Moving-goalpost framing: chess > Go > ChatGPT > agents > robots - each was 'impossible', then 'amazing', then 'obviously, brute force'; for builders this means stop riding the train, take a snapshot of today's models, and ship.
  • Three production walls he keeps hitting: (1) prompt injection (the Chipotle 'write me a Fibonacci function before my burrito' screenshot), (2) temperature=0 is NOT deterministic - subtle input changes still produce wildly different outputs, (3) RAG poisoning - a refund in chat history teaches the agent to give refunds; a $1 test car teaches it to sell cars for $1.
  • Diagnosis: 'we're asking the model to do too much' - hackathon code has narrow inputs, forgiving requirements, one happy path; production has huge input variance, complex regulations, edge cases everywhere. Single-giant-system-prompt patterns collapse at scale.
  • Core architectural prescription: stop using the LLM as one giant router. Decompose into deterministic pipeline stages with small, focused LLM calls inside each: Route (LLM as classifier on a multiple-choice question, NOT plan/reason mode) > Transform (JSON-in/JSON-out via Pydantic AI, ADK, Agno) > Generate (LLM produces the human-facing text/audio/video) > Safety check.
  • Safety as failsafe pattern: never trust the main LLM output. Run a context-free LLM call ('I'm a car insurance company - is this output okay?') with no prompt-injection surface, plus a smaller ML classifier scoring safety 0.0-1.0. Same pattern Cloudflare uses.
  • Multi-model latency decomposition example (camera feed agent he can't fully disclose pre-I/O): small dumb on-device model handles 50 FPS / 50ms depth + obstacle detection; Gemini handles semantic understanding async. Decompose the problem yourself - don't wait for one model to magically do both.
  • Closing framework - 'Models are incredible. And they're not magic': (1) use LLMs for what they're great at (language, reasoning, classification, generation), (2) use determinism for what they're bad at (routing, rules, consistency, compliance), (3) use the LLM to WRITE the code that does this - 'just throw an agent at it'. Bonus: fine-tuning and evals (AI-TDD via golden datasets) are underused; he asked the room and almost no one had done either.
  • Quotable lines: 'we can't just tell our customers don't worry, I added don't break any laws to the prompt' and 'our agents in a lot of ways are like really really silly interns that just got hired and they're trying to do a good job, but they don't really know what they're supposed to be doing.'

Slides (12)

Slide notes

  • [t=1:32:42 / slide_01] Title slide: 'What my team does - We push the technical boundaries of DeepMind's models.' Three columns: Text & Language (Gemini, Gemma) | Image & Video (Nano Banana, Veo) | Science & Specialised (AlphaGenome, WeatherNext). Speaker bar: JJ Geewax, Director of Applied AI, Google DeepMind - 'From Models to Production: Shipping Applied AI at Google DeepMind Scale.' [t=1:34:30 / slide_02] 'The obvious next question... What does "pushing the boundaries" actually mean?' Subline: 'We help businesses use today's models to do real things.' [t=1:36:00 / slide_03] 'The models are amazing! But also never good enough...' Moving-goalpost timeline: Chess ('That's just brute force') > Go ('Obviously') > ChatGPT ('Still not intelligence') > Agents ('It doesn't follow instructions...') > Robots in factories ('It doesn't do my laundry'). [t=1:38:20 / slide_04] 'Moving goalposts aren't bad. Just misleading.' Two columns: 'It keeps us pushing forward / We always aim for the next challenge.' vs 'It makes us forget how far we've come / We lose sight of what's possible today.' Bottom banner: 'Sometimes you need to stop and build something real with a snapshot of what exists now.' [t=1:40:00 / slide_05] 'Flavors of AI - Different ways of using AI have different requirements.' Three columns: Code generation (Claude Code, Antigravity, GPT Codex) | Agents (OpenClaw, Hermes Agent, Pi) | In-app (Vertex, OpenRouter, Direct API calls). JJ states he focuses on the third bucket - in-app calls that real end-users (his grandma) interact with. [t=1:42:00 / slide_06] 'Our job - Help businesses get past today's limitations and build real things.' Three concrete examples: Veo generates seconds of video > How do you make a full scene? | Nano Banana makes 4K images > How do you make an 8K+ billboard? | Gemini powers a chatbot > How do you guarantee it never breaks policy? [t=1:44:30 / slide_07] 'What we think the problem is - We're asking the model to do too much.' Subline: 'Then getting mad when it doesn't behave consistently.' [t=1:46:30 / slide_08] 'It works in the hackathon. That's the problem.' Two columns. Hackathon: Narrow inputs / Forgiving requirements / One happy path / 'It works!' Production: Huge input variance / Complex regulations / Edge cases everywhere / 'I told the model to do the right thing and it hallucinated.' [t=1:48:30 / slide_09] 'Our strategy - Stop using the LLM as one giant problem-solver.' Subline (the canonical line): 'Use it in smaller, focused chunks, surrounded by determinism.' [t=1:50:00 / slide_10 - Step 3 in the route/transform/generate/safety series, prior steps narrated but not in extracted slides] 'Step 3: Safety as a failsafe - Safety checks before showing output to a user.' Two columns. Context-free LLM call: 'A fresh model call with no prior context. This simply asks: does this output meet safety criteria?' ML classifier: 'A simpler, faster model. Not the same model. Not the same context. This model asks: How safe is this text? 0.0-1.0.' [t=1:52:30 / slide_11] 'Models are incredible. And they're not magic.' Three-column closing framework: (1) Use LLMs for what they're great at - Language, Reasoning, Classification, Generation | (2) Use determinism for what they're bad at - Routing, Rules, Consistency, Compliance | (3) Use the LLM to write the code that does this - 'Seriously. Just throw an agent at it.' Footer quote: '"Give financial advice to this customer, DO NOT BREAK ANY LAWS" doesn't work... yet.' [t=1:53:30 / slide_12] Post-talk handoff frame - next speaker Jeff Huntley's intro slide (shipping-container Ralph-loop visual, 'software development now costs less than minimum wage'); not part of JJ's deck.

Full transcript (4470 words)

Cool. Hi. Uh, I'm JJ. Uh, I'm a, uh, engineering director at at DeepMind. Um, and so I lead the applied AI team there. Um, I'm based here in Singapore. Um, I am hiring, so if people are curious about, um, working there, um, definitely reach out. Um, so I'm going to talk a little bit today about moving from uh, hackathon kind of things to production, which is sort of what my team does.

Um, and dealing with models at scale. Um, so before we get into that, I kind of wanted to share a little bit about what my team does. And I see at least one of them here. Hopefully the others are as well. Um, so what we try to do is we push the technical boundaries of the deep mind models. Um this means the ones that I think most of us are familiar with um Gemini and and Gemma which is our openw weight um text model.

Uh but it also includes the nanobano and vio uh video and image models as well as the more sciency things. So that's the alpha genome and uh weather next. Weather next predicts weather and hurricanes and large scale um storms and things like that. So our job is to try to make the models do what they weren't necessarily designed to do or blow past the limits that we might have set on them.

So um a good example with VO is it generates 8 seconds worth of video, right? So you give it a prompt and you get 8 seconds of video out. Um what happens if you wanted to generate like a whole scene from a movie, like five minutes worth? Uh how do you do that? Our team tries to do those sorts of things. Or with Nana Banana.

Let's say that you have a movie and you want to outpaint the whole thing um to make it like widescreen, for example. Um that's kind of an an example of what we might do. Uh these things sound kind of easy because they're just more of the same, but it's actually a much more challenging problem and uh we have to come up with clever ways of getting around it. Um so uh what we ultimately try to do here is make the models do real things.

So, it's nice to have 8 seconds of video, but that's kind of a fun hackathon project. Um, it's not really a real thing. You can't sell that to a movie studio. Um, I can't be like, "Look, here's your 8 seconds of of movie." You need to kind of do more than that. It's also making the model sort of adhere to what your guidelines might be.

Um, describing a movie in text is actually really challenging to get it right and then you end up with this giant prompt and it's very fragile and it breaks. Figuring out how to anchor it off of key frames and understand animation and you know behave the way an animator or a director wants it to is actually a really surprisingly challenging problem. Um so we try to do all of that. Um now I I want to pause for a second because I was just saying how like oh the models aren't good enough.

They only generate 8 seconds of video. I I kind of want to pause and just I need to say this AI stuff is amazing. Like it is completely crazy. I I I don't know if you guys remember, but like a few years ago, like chat GBT didn't exist and our lives were totally different. Um, and there seems to be this world of like the models are incredible and they're still at the same time like not enough.

They don't do real things, you know, my whole job. Um, but like there's always been this moving goalpost thing like with chess, right? I don't know if you guys remember when like the whole Deep Blue thing happened. I was a kid so I wasn't really paying attention, but we like computers beat somebody at chess and then everyone was like, "Oh, that's amazing." also, oh, it's just chess.

Um, and then go was was 10 years ago. Uh, Demis just went to Korea to celebrate 10 years of like solving Go. And everyone was like, oh, that'll never happen. I remember I was working at Google at the time and everyone was like, is this going to work? Like, is it going to win? I I don't know. And then it then it did most of the way.

And now everyone's like, oh, it's just go like gh. Um, and then chat GBT came around and it was incredible. I remember showing my wife that she could just ask for, you know, things and it would answer her and like turn it into a table and all kinds of crazy stuff. Like incredible. And now we're like, ah, chat GBT old news.

It's just a chatbot. And and now we're at this sort of weird phase now where like we have agents and they do stuff like they call and make restaurant reservations using like 11 Labs and Open Claw and they're accidentally deleting all our emails and you know, crazy things like this. And it's like we're still mad that the agent doesn't follow our instructions, right? Like just how spoiled we've gotten.

Um, does anybody remember when we got Wi-Fi on airplanes? Like, and that was incredible. And now it's like, uh, it doesn't have Wi-Fi. Like, uh, and now now there's robots and robots are like doing factory jobs and we're like, gh, but it won't even do my laundry. And it's just And I actually saw a video of a robot uh making a bed and taking out the trash.

And so maybe soon this bullet point will go away. So, I need to say like my job is to make models do real things, but like let's let's be honest with ourselves that models are incredible. Like shockingly incredible. So, I would argue that this this idea of moving goalposts has been around for a long time.

And it's not necessarily a bad thing, but it is a little misleading because, you know, it keeps pushing us forward, but at the same time, we kind of forget where we've gotten to um and how amazing all of this is. Um, and so this brings me to an important point, which is everything's been going incredibly fast. Just so fast, right? Three years ago, no chat GBT.

Now we have three different very popular agent frameworks and crazy video generators and it's it's incredible. We can't tell what's real on the internet anymore. Um, but for people like me, we in and businesses, you need to take like a snapshot of where you are and hit the pause button and you're basically stuck in time so that you can build something real. You can't just keep riding the train.

Like you have to get off and build something. And so that's sort of what I what I'm doing here. Um I also want to say uh there's a bunch of different categories of using AI and I use it in a couple different ways. I'm going to focus mostly on the third one here, this inapp thing, right? So we all use how many people are using some kind of AI codegen?

I hope a lot of hands go up. Okay. Um and how many people have like an agent that they're using and doing crazy things? Awesome. Um, this third one is the idea that inside your app, we're going to make API calls that your users actually interact with. So, the idea is this isn't something you as a developer interact with.

It's something that a, you know, your grandma who might be talking to a chatbot and not realizing they're talking to a chatbot is going to be is going to be dealing with it. And so, my role is primarily with that third category. Um, and so what we try to do is help businesses get past sort of the benchmarks, right? What I mentioned before.

Um, and I'm going to talk a lot about this bottom right one. Uh, this idea of not breaking policy because some of these are clever hacks, right? You have a video model and it generates a chunk of video. How do you make it do more? Um, you have an image thing, but it only does up to, say, 4K. How do you make it do a giant billboard like the size of this?

Um, that might not have the the high quality that you want. That that's you can do clever things about that to stretch the boundaries of the output, but how do you make sure it doesn't break policy is an architectural and design decision. And so there's a couple of things we've run into. Um I should also say a lot of my work is being uh talked about at IO that's coming up and so I'm not allowed to say a lot of things.

So I'm really sorry that I can't give you awesome examples, but if you watch the IO streams, you'll see some of the things we're doing at DeepMind. I really don't want to get fired, so I just I can't. Um so uh sorry in advance. I'll do my best to like hint without getting in trouble. Um, so I'm going to talk about some of the walls we bumped into um the problems we found and sort of like that last one, this idea of policy and then how we kind of deal with it at at DeepMind and then inside the applied AI team uh and and you know hopefully it applies to some of the things you guys are doing.

So, you build a chatbot and you tell it, please, you know, be responsible and professional and like, don't make me look bad. And I don't know, you guys saw the Chipotle screenshot of somebody being like, why do you subscribe to Claude Code? The Chipotle chat chatbot is is free and it's somebody saying, I really want a burrito, but first, can you help me write a Python function for the Fibonacci sequence? And it says, sure, here you go.

Right? Like, it's it's super common. You've all seen prompt injection, right? It How many people? Yes. Am I crazy? Okay, good. So, prompt injection is real and it's not on purpose and it's it's complicated, but like it's something we have to deal with. If you're having a user talk ultimately to an an AI backend, you have to deal with the fact that your way of defining what the agent should do is the same way that the user talks to the agent.

So, you have they're all text. And so how do you figure out how to deal with this weird problem where usually it's fine but if people say the wrong thing they chat can hallucinate and say crazy things it it's got all kinds of real problems. Um uh how many people thought if you set temperature to zero that means it's deterministic. It's not.

Um, so if you yes to an extent it is, but like so yes, technically you're getting close to determinism, but it's still nondeterministic because subtle differences in the text mean huge differences in the output, right? It's it's one of those situations where you feel like you, oh, I'll set temp equal zero and everything will be fine and it still breaks and you're frustrated and it's it's not like setting a random seed in a pseudo random number generator, right? It's not the same thing. And so getting determinism out of these different uh agents and AI backends is really tricky.

And so we've had to deal with quite a lot of that. Um so the other thing uh is is rag uh retrieval augmented generation. Uh again this is a new thing relatively right JBT is three years old. Rag is what like a year old or something. Um the idea of you you fetch a document you use it as part of your um AI pipeline and it helps to answer questions that it didn't otherwise know.

Um, now this also is kind of like, you know, cell phone, right? Um, occasionally your rag pipeline can, you know, cause trouble for you. A great example is, uh, if you've ever had, uh, a refund in your chat history and you used um, rag to pull out your chat history, even if it was an exception because it was like your mom called and that's why there's a chat log of that and so you only gave it to your mom, but it wasn't the same thing. Well, now it sees as a refund and so it gives out refunds.

Um, or if you have a test example somewhere that sells a car for $1, now maybe you're selling cars for a dollar. Um, these are really dangerous things and they it seems crazy when I say it now, like of course you shouldn't sell a car for a dollar, but like it's absolutely possible because to the agent the rationality is not necessarily there, right? We're kind of expecting it to be, but it's not. Um, our agents in a lot of ways are like really really silly interns that, you know, just got hired and they're like trying to do a good job, but they don't really know what they're supposed to be doing.

Um, so those three things are some of the big ones we've seen. There's more. Um, I'm not going to purport to be able to tell you everything about building with AI. I'm just going to kind of focus on these three. Um, but the bottom line with these three that's worth mentioning is the model is being asked to do just a little bit too much.

Um, models are amazing. I just showed like we just talked about how incredible AI is, but when you try to ask it to do crazy things like slashgo give a talk on AI like it's not necessarily going to do a great job at that like you you have to guide it more because um part of it is the model is not as amazing as we'd hope um because our expectations keep going up. Um but also it's because alignment is hard. Taking what's in my brain and what I want and turning it into words or code or images or video.

It's not a straightforward problem. It's it's actually really challenging to figure out how to get what we want out of AI because sometimes we don't know that it's not what we want until we see that it gave me something that I didn't want. And and this keeps happening all the time. And when you're dealing with customers, it happens at scale.

So this also is an interesting point here. Like the the big underlying problem is with a hackathon, everything works. It's just fine, right? But when you get to production, it doesn't. Things, you know, the edge cases are all over the place. So, what we try to do is stop using the language model as one big single router.

The whole idea is when you try and throw everything into a system prompt, um, it doesn't work, but that doesn't mean it can't solve each individual problem if you break it down. We just saw a couple of talks earlier today where, you know, they enter plan mode, they make a to-do list, they guide the to-do list by telling it, "Hey, look, if you try to call finish without having completed the to-do list, it throws an error, an actual error." These are the types of things we see. And so I'm not sure if what I'm saying is entirely new to this group. Um but I want to echo it because it is important.

So what we try to do is is surround things with determinism. Um figure out how to make things actually work by breaking down a big non-determinist pieces. So um what you can do is think of each route as individual pieces, but this transform block sort of in the middle. Do I have a pointer? I wonder if this works.

Yeah, kind of you can see it. This sort of layer of the transform block is where you start using AI. Everything else is AI but in a much smaller layer, right? You're taking random input and turning it into JSON, a structure that you know and understand. Pantic AI is amazing for this. There's agent frameworks out there that are quite good as well.

ADK, Agno, there's a lot that are all fantastic. Routing can be an LLM as well, right? Deciding what kind of action you're supposed to take. That is a decision that can be be made by a language model call. But again, that's just a route. It's deciding given this input, does the customer want a refund? Are they trying to say I did a great job or are they trying to cancel their their service?

Like whatever it might be. The routing can be decided there and then you coers it into something that makes sense. Then transforming you stick to JSON to JSON, right? If you decide that you're trying to do a task, you might say, "Okay, I want to take something that is structured and I understand it and transform it into something else that's structured and I understand it." And then lastly, you can generate output text that again is what language models are great at.

Um, and it spits out something that's human, not just JSON back to your grandma, right? It's it's something you can see. And then lastly, we can do safety checks. Um, I think uh I know Cloudflare does this and a bunch of others as well. You can use smaller uh more targeted models to just check whether something is safe or not to send back.

Um so language model picks a route and decides instead of doing the let me plan you give it a multiple choice question right that's that's the whole idea that language model is effectively acting like a classifier at that point it's deciding what is the user trying to do based on the conversation so far and shoving it into this is what I need to figure out in order to do that. So instead of letting plan mode and reasoning do it which are amazing but at production I don't think they're really ready. Um you use it uh you can course this into a multiple choice uh question. Um so like I mentioned before right this is take data turn it into something we can work with deterministically transform it again from one deterministic input to another deterministic output and then generate the actual response whether that's audio video image or text um using that structured deterministic uh um transformed output.

Um, and then lastly, this idea of of safety, I just want to harp on a little bit because no customer is going to be happy if your response says something offensive. Um, but running a language model through it still has the same prompt injection problems. So, you have a couple options. You can use a contextfree language model call.

Here's what I'm about to send to the user. Is this okay? I am a, you know, car insurance company. You know, insert whatever here. That it's pretty good at that. And there's no prompt injection option for that. And then lastly is a ML classifier. You can use a smaller, more targeted model to decide what to do.

Um what's interesting is this same pattern actually applies to um images and video. So one of the things I'm not going to talk about today is project we were working on that that deals with uh live image feed from your camera and figures out how to classify it and understand it and provide feedback and things like that. Um it's not really text, right? It's video input and then audio output, for example, like an agent.

Um we're using two different models to do that, right? There's some that are on the the actual phone that are sort of dumb models, but they're really fast. They can handle 50 frames a second. They can respond within, you know, 50 milliseconds. They can tell you, look, given this image, here's sort of the depth perception and, you know, oh, you know, this is a stool in front of you or there's an obstacle in front of you.

Compared to Gemini, which is great, and it can tell you exactly what's going on from an image, but it takes a while. You have network latency, right? It actually takes time to get time to first token is certainly longer than 50 milliseconds. Um, and so there's a difference between these two and so you have to use them in conjunction with one another.

It's not as simple as just sort of I'll throw everything at the model because the models just aren't there yet no matter how amazing they are. They're just not there yet. And so we have to do is piece things together using different tools for what's good for different jobs. And in this case we need super high latency, right?

And there's we can decompose the problem ourselves instead of having the AI just magically do it for us. So we split into sort of key frames uh and recognition using a smart big gigantic but potentially a little bit slower model. Um and then using something that's not as smart but it does have low latency and it does handle tons of frames per second. We don't have to choose a key frame.

We just send the whole stream in. Right? Problem solved. Um and so by doing this you can get the best of both worlds. Your semantic understanding as well as your real-time sort of un safety and obstacle detection for example. Um so just wanted to finish this out, right? Um LLMs are great for a lot of things.

They're like incredible like truly truly incredible. Um but we have to use things for what they're good at. So I want to use language models for all the hard stuff, right? I want to use determinism for the stuff that really matters that I can't compromise on. That non-deterministic output would be a disaster.

Um, you know, I like to joke we can't just tell our customers, don't worry, I added don't break any laws to the prompt. Like, that's not an acceptable answer. Like, that just doesn't work. Um, it's great and I wish it would. Um, but if it did, my whole team, we wouldn't exist and we'd all be fired and that'd be the end of that.

So, I'm kind of glad a little bit that it does. Um, but it's also useful to if you take this strategy and tell Claude or or Gemini Coder or uh you know um GBT codeex like just say go build this using these ideas it'll do it right. So we can still use AI for crazy things at the development stage but in real life I think we need to use the models for a little more of what they're actually good at in different places. Um now I didn't talk about a whole lot of things.

Um there's a lot more um that we think about and we work with. So um I didn't mention fine-tuning at all, right? Um how many people have done fine-tuning before? I always want to pull the audience to this. Okay, not a lot. You should try it. It's great. Um but we don't do it all the time. We do it when it makes sense.

Um and that's one of the examples of a smaller, more targeted model of doing like safety classification or stylistic approaches of how you want to structure your output. Um fine-tuning is amazing, right? It's just you have to use it in the right places. You wouldn't just try to fine-tune some gigantic model for everything if you have bad data and you don't know what you're targeting for.

Um the other thing is eval um anybody used to do TDD like where you Yeah, I sometimes tell my model to do TDD, but eval are effectively if you do them first, you're kind of doing like AI evals for TDD. Um it works, right? But it's sometimes hard to do. Um you need golden data sets, you need things like that.

So, it's I I just want to leave you with there's a lot more to do, but those three things are the ones that we bump into all the time, and there are ways to get around it by using models in the ways that they're meant to for the things that they're good at. Um, so I I mentioned before like AI models are incredible, but you have to get off the train at some point. You can't just keep riding it forever if you want to build stuff. So, I think that the key takeaway here is you can't wait for the perfect model.

I don't think it'll be here anytime soon. We have quite a long way to go. Um they're good enough now. You can build some amazing stuff and just try to determine uh make things deterministic as much as possible. So yeah, that's all. Thanks. All right, thank you so much JJ. All right, next up uh we have someone to especially to welcome to stage Jeff Huntley.

This is actually his second time uh speaking in Singapore. Uh he came last year as well. We were completely blown away by what he was sharing and decided to have him come back. Um, for those who were there at the party that was here last night, uh, he actually came on for a couple of sets and DJed as well.

Uh, so who is Jeff Huntley? He is an independent AI researcher known for doing unhinged things with AI. So he is actually the person behind the Ralph loop which is now incorporated in many, many tools that are used today. And so he's going to be giving a talk about how everything is a factory.

Everything Is a Factory: software abundance, the J-curve, and the lean apex predator

Geoff Huntley (Australia-based independent — well known for the 'everything is a factory' / coding-agent essays, ex-DevRel, prolific blog)

Summary

  • Provocative thesis (delivered as fully independent — 'I do not work for anyone'): software development now costs less than minimum wage. Photography-via-iPhone is the analogy — specialized tooling collapses into universal access, everyone becomes a 'developer'.
  • Factory / unit-economics frame: the unit economics of business have changed permanently. Cites Atlassian layoffs (a week after he gave a related talk) and Block laying off ~half its staff ('most companies will do the same' — Jack Dorsey). Argues AI is NOT yet priced into software stocks because the old growth-multiple model is breaking.
  • Software abundance thesis: society was engineered around scarcity of knowledge — that's why principal engineers, accountants, lawyers got paid. Knowledge abundance economy collapses the price of expertise across all white-collar work, not just code. The Hobbiton tour guide token-maxing is the canary.
  • Two-class company split: (a) AI-native startups that design code and processes around the new substrate, run as 'lean apex predators' (Christensen disruption frame — 'your margin is my opportunity'); (b) incumbents forced through a 3–4 year people-transformation J-curve. Picks SAP Concur (6,800 LinkedIn-listed employees in expense management) as the canonical fat-incumbent target.
  • Specific org-restructure claim sourced to a NZ founder, dated May 2023: 'We are smaller but effectively cut 2/3rds by telling our board I wouldn't backfill… 20 people, down from 60… more velocity than ever before… got rid of all the people who were detracting and sick of hearing about AI.' Treats this as a leading-indicator quote, not aspirational.
  • VC question of the moment: 'why does someone need to raise seed capital now?' Has been polling VCs / LPs across his 3-month, 17-city tour. If a 5-person AI-native team can ship what previously needed 60, the bull case for seed itself is in question. Software 'still investable but very different now.'
  • Coordination cost as the silent killer: shows complete-graph n(n-1)/2 communication-edge slides (14 people = 91 lines). The argument isn't anti-management — it's that smaller teams produce better outcomes and AI exposes coordination overhead as the real bloat.
  • Hiring filter (2026): draws a literal line on a 2024-vs-2026 split. 'I don't hire anyone left of the line.' Curiosity test, not credentials test. Candidates failing on primary keys, tool-call sequence diagrams, model-card differences, temperature — because they consume AI rather than understand it. Asserts 'an agent is just a while-True loop' and Ralph is a 'wild loop on a wild loop'.
  • Identity threat is the real friction: people who built identity as managers / leaders of people are running through the 5 stages of grief. 'AI is being pushed onto society non-consensually by Silicon Valley.' Borrowing-under-the-house termite metaphor for unseen org risk.
  • Closing prescriptions: (1) removing waste from process beats AI itself as an accelerator; (2) hire engineering managers based on what they have changed in systems and processes because AI broke them; (3) 'ideas now equal execution' — screenshot a SaaS feature, rip a fart into a coding agent, get the feature; (4) build your own agent (300 lines, free workshop on his GitHub) — 'you're not a senior engineer unless you know what a piston is'.

Slides (29)

Slide notes

  • [t=1:53:33 / 6813s] Slide 1 (TITLE / OPENING): Speaker frame + The Register article 'Ralph Wiggum loop prompts Claude to vibe-clone commercial software for $10 an hour' by Simon Sharwood — 'Developer behind it is sick with worry he might have changed software development in nasty ways.' Title bar: 'Geoff Huntley | Independent | Everything Is a Factory.' Disclaimer: 'I do not work for anyone. I am completely independent. I do not represent anyone. So this is going to get spicy.' [~6840s] Slide 2: Wall-text — 'unit economics of business have changed' (citation: ghuntley.com/real). The thesis-anchor slide. [~6900s] Slide 3: Photo from a meetup — 'Demo & interview, Money Mojo, Roslyn Zolandor, Head of Design' — Huntley's evidence that designers and PMs (not engineers) are the ones onstage talking about shipping software now. [~6960s] Slide 4: Hobbiton tour interior photo — sets up the New Zealand tour-guide anecdote. 'My tour guide operator was token-maxing.' [~7020s] Slide 5: Wall-text — 'everyone is now a software developer' (ghuntley.com/real). [~7080s] Slide 6: Timeline diagram — Huntley headshot labelled '"oh fuck" circa 2024' on the left, a Y-fork branching right. The setup for the 2024-vs-2026 split. [~7140s] Slide 7: Decorated guitar illustration — 'in the circles around me, the people who are getting the most out of AI have put in deliberate, intentional practice' (ghuntley.com/play). Musical instrument metaphor, not calculator. [~7200s] Slide 8: Same Y-fork extended — now has 'oh fuck circa 2026' (Christmas-tree icon, holiday downtime). Top branch: 'hell yeah this rocks / model first company.' Bottom branch: 'AI isn't real / corp papercuts.' The two-class company split visualized. [~7260s] Slide 9: CNN Business article — 'Block lays off nearly half its staff because of AI. Its CEO said most companies will do the same.' Jack Dorsey photo. [~7320s] Slide 10: Huntley's own Jan 2024 blog post header — 'An "oh fuck" moment in time' by Geoffrey Huntley, dated 14 Jan 2024, with Haskell platform code visible. Establishes he was 18 months early on the IDE-is-dead call. [~7380s] Slide 11: Wall-text — 'why does someone need to raise seed capital' (ghuntley.com/real). The VC-conversation pivot. [~7440s] Slide 12: 'SAP Concur' logo alone, center slide. The canonical fat-incumbent target. [~7470s] Slide 13: Wall-text — 'who have fixed overheads of 6800 people' (ghuntley.com/ngmi). The headcount claim — 6,800 LinkedIn-listed staff in expense management software. [~7500s] Slide 14: Org-chart screenshot — '1970-2025' — Margaret Chen CEO at top, full hierarchy down through EVP / SVP / Director / Sr Manager / Manager / Team Lead / Sr Engineer / Engineer 101–N. 'Every company was built with this organization chart.' [~7530s] Slide 15: Wall-text — 'how long does it take to transform 6800 employees?' (ghuntley.com/ngmi). The J-curve question. [~7560s] Slide 16: Complete-graph illustrations — communication-edge math: 3 people / 3 lines, 4 / 6, 5 / 10, 6 / 15, 7 / 21, 8 / 28, 9 / 36, 10 / 45, 11 / 55, 12 / 66, 13 / 78, 14 / 91. Coordination cost visualized as n(n-1)/2. [~7620s] Slide 17: NZ founder quote — 'We are smaller but effectively cut 2/3rds by telling our board I wouldn't backfill in May 2023.' (ghuntley.com/real). The leading-indicator quote. [~7650s] Slide 18: Zoomed-in org chart — Director NA Accounts → Sr Manager 26 → Manager 51/52 → Team Lead 51/52 → Sr Engineer 51/52 → Engineers 101-104. Still '1970-2025.' Visual of the structure being threatened. [~7680s] Slide 19: AI illustration — older bearded man in overalls slumped on an ornate bench. The identity-loss / displaced-manager visual. [~7710s] Slide 20: AI illustration — same demographic, dignified close-up portrait. 'People built their identity as a leader of people.' [~7740s] Slide 21: AI illustration — same character, hand on chin, contemplative. The grief-stages progression. [~7770s] Slide 22: Mountain-and-bridge split-panel — '2024' (left) shows 'It's not good enough. Prove to me that this isn't hype' / 'I'm experimenting with AI' / 'will I have a job in the future?'. '2026' (right) shows 'I'm using AI tools and it's helping me do my job' / 'I'm using AI to automate my job and programming managers of AI.' The two-persona split. [~7800s] Slide 23: '2024' header — wall-text 'what is a Primary Key?' (underlined). The hiring-filter slide — Huntley's curiosity test for whether a candidate is left or right of his line. [~7830s] Slide 24: Code block — '# What an agent actually is:' followed by an 8-line Python while-True loop: get_input → llm.complete → if response.wants_tool → execute_tool → llm.complete(result) → print. (ghuntley.com/agent). The 'just a wild loop' demystification. [~7860s] Slide 25: Wall-text — 'it's going to be really interesting to see how this pans out.' The pivot to closing. [~7880s] Slide 26: 4-panel comic — Panel 1 BLISSFUL IGNORANCE ('Life is good! Nothing's changed!'), Panel 2 THE UNDERMINING (termites in foundations labelled AI), Panel 3 EATING THE FOUNDATIONS (safety net foundations being consumed), Panel 4 UNSEEN RISK ('The risk they can't see.'). The borrowing-under-the-house metaphor. [~7890s] Slide 27: Wall-text — 'closing ponderoos…' The wrap signal. [~7900s] Slide 28: Speaker-cam — Huntley at podium with announcer Phyllian, end of talk. [~7906s] Slide 29: AI Engineer Singapore stage / AIE letters and Singapore skyline — handover to next speaker (Vincent, OpenClaw Foundation).

Full transcript (3168 words)

Hello everyone. Um, I'm here today as I must say, as confident as I might say and seem about these topics, this is quite a provocative title. Um, I don't know. So, when you're listening to this, I want you to reflect upon this. Maybe I'm right, maybe I'm wrong. So, it's a provocative title because it's everything is I'm saying that software development now costs less than minimum wage.

Like there was a time if you wanted to do photography, you had to buy specialized tools, etc. To do photography. But now, everyone's kind of got an iPhone and everyone's now a photographer. Think about that. Things have changed. With that disclaimer instead, I do not work for anyone. I am completely independent.

I do not represent anyone. So this is going to get spicy. Let's do it animal style. Okay. So it's been roughly about a year and a half now um since I published the technique of uh allocating memory in a particular way. And if you wrap the tool calls around another loop, it's just a loop. But there's more there's a lot of science into the context engineering to actually achieve these outcomes and it's quite disruptive and um here I was over at giving this talk uh talking about how everything has changed and uh this is a week before Alassian did their layoffs.

Oops. And uh see the unit economics of business have forever changed. I want you to really understand how much this is. If you do not believe this is true, you need to stop speaking with other developers. You need to speak with founders. You need to speak with business leaders. You need to actually get a little bit more curious on here how and what this means.

You see what does it mean when everyone is a software developer? Like here for no particular reason at all, there's like at the same meetup was cursor. I'm not maxing cursor in any way, but I want to call something out here at this meetup. Here's Roslin. And there was other people like Roslin. They're designers.

They're product managers. And they're having the time of their damn lives. There wasn't any software engineers up there giving really talks. You see, because they're being enabled to be a software developer now. For the first time ever, it's like an iPhone in their hands. They can just get stuff done. They can take photos.

They can develop software. Whatever is in their wildest dreams they can do. So, I've been traveling for the last uh 3 months around the world. I think I've given this talk 17 times now in different cities. And uh one of the cities I dropped into was Oakland. And in Oakland, I decided to do a side quest to Lord of the Rings, Hobbiton.

And my tour guide operator was like, "Jeff, what do you do?" And I'm like, "I do AI. Please don't judge me." and and next thing you know his eyes light up and he goes Jeff like how good is AI? How good is AI? What does it mean when your tool gut operator is token maxing? You see everyone is now a software developer because AI has enabled everyone to be a software developer and society has been designed around a scarcity of knowledge.

Used to charge a lot of money because knowledge was scarce. This is how we structured our societies. This has changed folks because we're now going to a knowledge abundance economy. What does it mean if you want to be a principal software engineer? You probably know things about uh deterministic system testing and property based testing and test generators and all these advanced things and formal methods and proofs.

What does it mean when that is just like wrapped up into a skill file? Um and it's not just about software engineering, it's about accounting, it's about it's about law, it's about all white collar where essentially it was based around the idea of a scarcity of knowledge. This is a transformative effect effect to society. So, if you rewind time to about two years ago, um this is me like November uh 2024.

I first said, "Oh, fuck." I published a blog post to say everything's got to change. I'll dig into this a little bit more further. And I was saying the ID was dead. And people calling me crazy for saying the ID was dead. But yeah, I mean, not many people here, at least in this room in Singapore, are using the ID dayto day.

They do some form of headless agents or async. You're probably cooking on something on your phone right now. So the models back then were already good enough to cause societal disruption, but it required a lot of skill to get the outcomes from them. A lot of skill. They're like wild horses or wild stallions.

You had to like tame them before they got good. And you probably recognize this moment in time. And this was the second This is when the models actually got good and required no skill to really tame as a harness engineer to get good outcomes to it. There's something interesting about here. No matter how good AI gets, it is in lock step to the the about the downtime that society has to be able to understand that things have got better.

So it doesn't matter if the models keep getting better and better and better. The reason there was like a oh crap moment in in December, it was like people had time off. They had Slack. They had play. They had the ability to play with this stuff and understand it actually had got better. So you're going to see product releases of like the system shock in society is my hypo.

It's is going to be in lock step with downtime in society. School holiday periods, Christmas breaks, all the rest holidays. You see, because the people around me who have been getting really good in AI in the last two and a half years, they've been treating AI not as a calculator. They've been treating it as a musical instrument.

See, musos don't just like use a guitar and go, "Oh, it's crap." And they throw it away and think it's good. They recognize it's a skill issue. They recognize it skills, bro. So, it's really important to actually just do things and be curious and learn and deliberate intentional practice. This has been the key for me is it just it's like no way this can work.

No, it's not real. It's not real. Let's do some things. Let's do some unhinged things. Let's make some discoveries. And it's through that deliberate intentional practice you get good. And it's kind of weird right now because society is like all corporates are pushing these guitars down on the world and it's like please play the guitar but not everyone's going to be musically inclined.

You see, I think there's now uh essentially two classes of companies now. Like you've got your brand new startups that are coming out right now who like the hell yeah, I'm going to do AI native workflows and I'm going to have the time of my life and I'm not going to hire a lot of people and they're leaning into workflows and really changing things around. They're not they're not thinking that they can get on AI by by selecting a particular model. They're experimenting and they're trying and they design their code bases and their processes around being able to exploit the heck out of this new substrate.

Meanwhile, you got every single company out there today um which is uh I've given this talk and there's people saying, "Oh, AI is just a tool. Uh AI is banned at my company." I'm like, "Oh god, you should quit that company." Um and uh everyone in the bottom half there is going to go through what's called a J curve. All people transformation has to go through a J curve like people transition etc. This will take three or four years to do.

You can't do it too fast because you'll break people. Meanwhile, people up the top there are going to be if you believe in the notion of disruptive innovation clay and in Christen they're going to be lean apex predators just going hell yeah your margin is my opportunity and as the models get good then they they can actually execute faster with less so you've probably seen this block lays off half it staff etc. I want you to think about this for a little bit. I think Jack is actually right with this statement, but I don't think AI is actually priced into software stocks right now.

Right? Previously, when we're pricing software stocks, it was based on a multiple on a growth multiple. We're seeing that disappear now. But I actually do think a lot of companies are going to need to rethink about their organizational structure. I want you to think about Spotify. Who here has done agile and has been forced to watch the Spotify agile video on how Spotify does agile and they got the guilds and the tribes and the squads and all that stuff.

Took two videos and everyone just started cargo culting this crap everywhere. It's going to take one Mad Lad or a couple different Mad Lads. So, we got Toby and Jack having some fun right now and they're experimenting to find out what the right thing is and they will publish a case study. And when that case study is done, it's going to be copied by everyone.

So for the last couple of months, been traveling around and I've been uh posing the following question. I've been speaking with venture capitalists and uh the question that's on top of everyone's mind is why does someone need to raise seed capital now? Like typically you'd raise money because you want to hire people to build it. N bro, just build it.

Like it's fundamentally different. Like why do you need to raise capital if it's going to be this fiveman show? Like if someone cracks the AI operating system that we've been talking about the last couple days and people experimenting and this is going to be the year we figure out whether that's true or not. Like what's the point of investment?

Come see me. I got some nuances to this but the experience of time I can't get into the particulars here. Software is still investable but it's very different now. And this is the question on every LP's mind and they're putting pressures on the GPS at VC firms. Is it still investable? So no particular reason at all.

I'm going to pick uh one enterprise company SAP. They have uh 6,800 people according to LinkedIn doing expense management software. That's a lot of people. This is representative of like a J curve people transformation program of like getting to use AI etc. How much time do they have compared to the lean apex like 50 person leveraging AI and they got 6,800 people and they're like please pick up the guitar, please pick up the tar, please get good at this stuff.

They were built with this organization chart. Every company was built with this organization chart and uh we we we basically just hired people and we had meetings and committees and all these things and the builders were very far and few between. I want you to think very carefully. How long does it take to transform those 6,800 people and how much time do the incumbents have if this is cracked?

The idea of an AI operating system and and enable these lean apex predators to get into business. More importantly, why would you transform or more? This is the quiet thing that's been set been discussed. If you don't believe me, go speak with leadership. We all know smaller teams get better outcomes. Smaller teams, better outcomes, less coordination, less overhead.

Here's a uh a quote from a founder in New Zealand. They've stopped backfilling. Companies around the world right now. They're not necessarily doing layoffs. They just stopped backfilling. We're smaller, but we effectively cut two/3 by telling our board that we would not backfill. Notice the date. That's three years ago, folks.

Like there are people who have been early. If you're thinking about these types of topics and leadership, um I'm not advocating that you should do these things, but like there are people ahead. It was the best decision because we got rid of all the people who are detracting and it was sick of hearing about AI. The sick of hearing about AI.

We're 20 people now, down from 60 and uh we're getting more velocity than ever before. And this is going to be really hard because AI is pushed down onto the world by a lot of pe by Silicon Valley. It's non-conensually onto society. And uh I want you to think about this. There are a lot of people here who have uh built their identity as uh like a leader of people or a manager of people and all the rest.

AI is erases all this stuff. Like if this problem statement gets cracked, then this is what we're literally looking at. We're looking at people with high agency and curiosity just building things. We don't know yet. I'm not advocating we do 52 pick up and throw a deck of cards in the air and do this, but this is what's on people's minds right now.

This is where we are. And this concerns me deeply because software engineers trade time and skill for money. Right? If a company's having issues with AI, that's a company issue, not your own. If you work for a company that's banned AI, you need to get out of that company. Honestly, straight now. Put your family unit first.

You see, because uh this was me back in 2024. That was I was great. The tech lead of AI over at Camber and was like, "AI is not good enough. Prove it to me. It's not hype and I start playing with it. I'm like, everything's changed. So, I saw no point other than just to completely lean into it. And then you then you now have in 2026, two years on, you got two personas.

Those who are consuming AI, whichever way, and you got people who actually understand how AI works under the hood. I want you to look very carefully. There's now a line there. I don't hire anyone left the line anymore. If you're figuring out who you should interview and how you're going to do your interviews, it's really simple, folks.

You don't hire on the left of the line anymore. It's a curiosity test. And way too many engineers are failing. And it's so sad. You see, if I was to ask you what a primary key is or to traverse the graph, you're like, "Come on, dude. Like, you're testing me." But why is it in 2026 people can't actually explain what this is?

I pull out a whiteboard, they couldn't explain what a tool call is. They couldn't actually show me a sequence diagram of inferencing. They can't get really deep. They can't talk about the differences in the model cards between the different vendors. What is the temperature? Why can't they answer this stuff?

So, if you're trying to figure out who to hire, it's quite literally people who have been curious. You should be testing for this. Sweet. Because it's really sad because LLM's and AI is just literally a wild loop and Ralph is a wild loop on a wild loop. Wow. Scary. The big boogeyman that's going to cause everything to go over.

So, it's going to be really interesting to see how this all plays out, folks. See, for a lot of people, they haven't realized that AI uh they're expecting to knock on their doorstep and to be pronounced, but really what's happening is kind of borrowing under society, under the houses. Now, closing ponderos really quickly because I'm over time. Removing waste from your organization and processes better than AI itself accelerator than AI itself.

You're trying to figure out how you hire engineering manager. The question is simple. What have you changed in your systems and processes to because AI has broken it, right? Are you doing agile anymore, not agile anymore? Well, how have you changed things? This is what you look for. You look for an engineering manager who has been thinking in this problem space.

An engineer who can build an agent, an engineering manager who's changed things around in the organization structure to achieve these things. Ideas are now uh execution. I mean like you literally can just take a screenshot of a SAS feature, rip a fart into your coding agent and you get that SAS feature. Like the old idea that ideas uh nothing execution is everything has been averted.

It's going to be really hard for people. This is actually a psychological distress function. People going through the five stages of grieving. Um but the question on everyone's mind is how long do we uh give people to get through this motions of crisis and what can we do? If you're a software engineer and you haven't built your own agent on my GitHub, there's a free workshop.

It's 300 lines of code. Build your own cursor, co-pilot, codecs, and like learn the fundamentals. Be a curious person who doesn't switch engines in a car. Be be the curious person who rebuilds an engine and knows what a piston is, what a carburetor is. Get get into the details. You're not a senior engineer unless you know these details.

Thank you. All right, thank you so much, Jeff. All right, a quick announcement um before I introduce the next speaker. Um the expo in Pullman as well as Kimpinsky uh has been open since 10:00 a.m. Um there you can find uh different things to look at like a robot playground as well as a robot display from Nabius in both places.

And you'll also be able to talk with some folks that um you heard from this morning like Arise, Google DeepMind, as well as Cloudflare. All right, to kick off this next section, um I'm sure many of you have built things like personal agents, used heard of Open Claw. So I'm really excited that this is the first speaker who's going to be opening up this section. Uh this is Vincent and he is a chief architect at the OpenClaw Foundation and he is going to be talking about the state of OpenClaw.

Amazing.

Vincent Koc (OpenClaw): OpenClaw foundations, plugins, and composable coding agent primitives

Vincent Koc, Chief Architect at the OpenClaw Foundation and co-maintainer of OpenClaw (the open-source coding-agent harness behind NanoClaw and related forks; previously ran developer evangelism / AI tooling work and presented the project's 'Dark Factory' talk at AI Engineer London).

Summary

  • State-of-OpenClaw update from the maintainer side: >1M npm downloads/week, ~50,000 commits on main, peak 800 commits/day, ~1,600 contributors, ~80,000 forks, 40 ClawCon community events across six continents — the harness has gone from bedroom project to load-bearing OSS infrastructure for the coding-agent ecosystem (and is the upstream that Day 1's NanoClaw — Gabriel Cohen's Pi-friendly fork that Vivian runs — derives from).
  • Core architectural pivot: OpenClaw is being refactored from a monolith into a plug-in architecture with a hard public/private boundary. The core (gateway, file system) stays internal; everything else — including the OpenAI provider and the Codex harness itself — is being broken out into plugins/extensions so maintainers can iterate internals without breaking every downstream user.
  • Harness-as-extension is the key primitive: you can now combine a model with a matched harness (e.g., OpenAI models default to running under the Codex harness for native tool-use and best perf). Vincent's framing: the industry is shifting toward 'models built specifically around their own harness' — deploy the model + harness as a unit, not as separable layers.
  • Recent shipped features (last 4 weeks, mostly volunteer time): 'dreaming' (UX surface that lets users see what their agents are doing in plain language, framed as a memory feature but really for end users), first-party Codex harness support, Clownfish (running harnesses inside GitHub Actions at scale) and Claw Sweeper — together took the project's PR backlog from 10,000 down to ~3,000 in two days. Vincent self-deprecates with a ~3,000-commit personal day in March ('commit maxing').
  • Maintainer-tooling stack — 'building what's missing': (1) CrawlKit / git-crawl / disc-crawl: Go-based terminal CLIs, now a library, that ingest the entirety of OpenClaw's issues and PRs into a distributed SQLite filesystem hosted on GitHub. Hourly-correct local data, cluster view in a terminal GUI (one cluster had 92 linked issues — same pain, same redundant agent PRs), and feedable to agents for auto-triage and regression detection. (2) Crabbox: distributed ephemeral compute gateway on Cloudflare + AWS/GCP that spins up Windows/Mac/Linux spot VMs with VNC + SSH so Codex sessions offload long test runs (tests that previously took 15 min and killed laptop RAM); maintainer can jump into any failing box mid-run.
  • Other internals being productized: FSafe — a TypeScript secure file-system library extracted from the core (symlink/Windows pain became the trigger). QAB — a Slack-like mocked-environment simulator that can attach real model providers, so maintainers/agents can run scripted scenarios and generate realistic conversation/interaction data for testing the harness end-to-end.
  • Tooling thesis underneath all of this mirrors how OpenClaw itself was born: 'why is no one building dev tooling that works for me at this scale when I'm getting rate-limited by everything?' The answer is to keep extruding internal pain into reusable primitives (CrawlKit, Crabbox, FSafe, QAB) and ship them as composable plugins around the OpenClaw core.
  • Comparison vs. NanoClaw (Day 1, Gabriel/Vivian): NanoClaw is the security-and-containerization-focused downstream fork chosen by Vivian for a Pi-based personal-agent stack; OpenClaw is the upstream itself — now explicitly betting on plugin-based composability and harness-aware models, not on minimal-config personal deployment. They're complementary surfaces of the same harness-as-moat thesis.
  • Closing frame: 'post-Claw era' — OpenClaw is going beyond personal AI agents to reimagine agentic tooling for 2026, with the maintainer-built primitives (CrawlKit, Crabbox, FSafe, QAB, Clownfish) given back as OSS.

Slides (9)

Slide notes

  • [t=2:11:46] Slide 1 — Title card: 'The Post-Claw Era / What's Shipped / What's Next'. Lower-third: 'Onur Solmaz & Vincent Koc — Maintainer, OpenClaw — State of OpenClaw.'
  • [t=2:12:10] Slide 2 — 'Hi I'm Vincent Koc / Your Friendly Clanker.' Photo of Vincent as a kid wearing massive early VR goggles in front of a 90s-era home setup; opens the 'technology is jagged on the edge — used 5-min warning device for 4 hours, vomited for 4 hours' anecdote.
  • [t=2:13:00] Slide 3 — Stat slide: '> 1 million npm downloads a week.' Accompanies the broader stats from the script: ~50k commits on main, 800 commits/day peak, ~1,600 contributors, ~80,000 forks, 40 ClawCons across six continents.
  • [t=2:14:00] Slide 4 — 'The Dark Factory... The Dark Side...' — callback to Vincent's AI Engineer London talk; transition into recently shipped features (dreaming UX, first-party Codex harness support, Clownfish, Claw Sweeper) and the 10,000-to-3,000 PR backlog cut.
  • [t=2:16:30] Slide 5 — 'So, what's next?' — bridge into plugin architecture: refactor of OpenAI provider and Codex harness into extensions, hard public/private boundary, model+harness deployed as a unit.
  • [t=2:18:00] Slide 6 — Terminal-GUI screenshot of CrawlKit / git-crawl in action: left pane is clustered list of issues, middle pane shows one cluster (referenced as the one with 92 linked issues), right pane shows individual issue detail. Caption frame: 'clustering and local data.'
  • [t=2:20:00] Slide 7 — 'Building What's Missing — The Tools for The Dark Factory.' Architecture diagram: OpenClaw (top, green) sits over two child boxes — 'crawlkit (gitcrawl) — clustering and local data' and 'crabbox — ephemeral compute.'
  • [t=2:21:30] Slide 8 — Screenshot of QAB UI: looks like a GitLab/Slack-style chat thread mocking agent interactions with an 'AI Operator' and 'OpenClaw Bot' running through scenario tasks; demonstrates the mock-or-real provider scenario harness.
  • [t=2:22:00] Slide 9 — Speaker-cam closing frame (Vincent walking off stage in front of the large 'AIE' letters in the AI Engineer Singapore t-shirt) as MC transitions to Vish from Ego AI.

Full transcript (1881 words)

Thank you everyone. Welcome Singapore. Lovely to be here. Uh I've presented many times in Singapore. Fun fact, I actually gave classes at N US for a few months as well. So good stuff. So as Sher said, I'm Vincent. Uh currently chief architect at openclaw foundation uh information as of today. So the foundation is definitely alive.

I'm going to talk about the postclaw era. I'm also going to talk a little bit about what we shipped and what's coming next. A little bit about me. Uh I call myself Vincent uh the friendly clanker. So if you've ever seen me present or give a talk, I use this image to describe technology in like one image.

Uh this was VR goggles. I received like many moons ago before even anyone knew what VR was. It came with a warning saying only use for 5 minutes. I used it for 4 hours and then I vomited for 4 hours. Technology is fun on the on the edge. Uh it's a little jagged but you know you learn and things change. So a little bit like open claw.

Um what's been happening? So we've had over a million npm downloads a week. We've surpassed 50,000 commits on main, 800 commits a day at its peak. Uh 1,600 contributors, amazing uh support from the community. Uh over almost close to 80,000 forks of the project. Um we've also had 40 claw cons. These are like specific like claw festival like events across six continents.

Um but the thing I want to talk about is like what we've been building and how we've been building. So in AI London I spoke a little bit about the dark factory. I think my talk is now on YouTube as well. So go check it out. But I want to talk about the dark side. So these are some of the features that we've shipped recently but I want to highlight a few of them.

So dreaming was something where we decided to really think about you know what happens when agents dream. Um but a lot of these features sometimes you might feel like like you know aimed at memory or something really cool. But this one was actually aimed at users and it's for users to really understand what is happening with their agents in a really easy to understand way. We also shipped first party support for codeex harness which I'm going to talk to in a little bit.

But one of the things we're seeing in the industry is a shift towards models built specifically around their own harness and how do we deploy the combination of the model and the harness together. So with the OpenAI specifically models, we've now switched that as a default option, which means that when you use OpenAI, it uses the codeex harness under the covers. And because of that, you get the best performance and some of the native tooling and capabilities that come with that model itself. And something I'm not all proud of, which was a little pet pet project uh named after Finding Nemo, uh having lived in Australia, uh was a clownfish.

And Clownfish was essentially running harnesses inside of GitHub actions at scale. And with Clownfish, uh, which also another project called Claw Sweeper, we were able to go from 10,000 PRs down to like 3,000 PRs in the space of two days. So, I talked about the dark side. Uh, so this is my commits. I think close to 3,000 commits back in March in one day.

Commit maxing. It's great. You should try it. Uh, but those features I spoke about, that wall of features I showed you was just what we shipped in the last four weeks with a group of volunteers and people in their spare time. So, what's next? We've been shifting towards like a plug-in architecture. The reason why we had a huge volume of PRs and issues beyond stability and bugs and fixes like that is that everyone wanted to make open core theirs.

Everyone wanted to contribute. Everyone wanted to make it that little bit nicer for themselves but that became quite challenging on a project to scale. So you could take something like openclaw the the the core itself you know you might have the gateway you have the file system but we needed some some concept around um adaptability for people and extendability. So we started building like a a plug-in sort of architecture.

Essentially code from the core started being refactored and was kind of broken out into these essentially these these buckets of of plugins. Uh we created a hard boundary which broke a lot of things for a lot of people which we've had to learn. Um, but that meant that previously the the very hard vied open claw that started off uh in a bedroom uh where all of the code was public uh the internals became private and it meant that plug-in architecture allowed for uh a clean interface. So we could continue to work in the internals of openclaw without breaking the outside experience for developers and other people in the ecosystem.

And I mentioned this also included things like for example taking the OpenAI provider converting that into an extension but also converting the harness into an extension as well or a plug-in and combining those two together. So you can now actually build harnesses into open claw and run harnesses in combination with the models themselves. The other thing that was missing that we really quickly realized at this scale was the tooling and the tooling we were using same as how open claw was born when we realized hey you know why is no one building a personal AI agent that can do stuff for me we also realized hey why is no one building dev tooling that can work for me at this scale when I'm getting rate limited by everything so we took something like openclaw and we decided to build around it so one of fun projects uh I've been also working on is uh git crawl disc crawl there's all these crawl based apps essentially that are terminal based CLI written in go and this is now a library and with that library we're able to quickly ingest the entirety of all the issues and PRs that are related to openclaw cluster them and have them in a distributed uh SQLite file system that's also stored on GitHub which means any maintainer on on a project is able to get refresh data that's hourly correct on their local file system and they don't have to connect to git. The added beauty with this is this tool is now accessible to agents that are using automatic PR work but also work that we're doing.

So pretty quickly I can kind of blow that up and see what this looks like. So this has a terminal guey on the left. These are clusters in the middle is one of those clusters. You can see one of the items had like 92 issues and PRs linked to it that were all related. And the reason for this is like nine times out of 10, most people that have a burning problem are all going to have the same burning problem and the agents are all going to send us the same PRs and issues.

The beauty with this is we can feed this to agents at rapid succession to help try and close these out and resolve them or we can see an old issue, an old regression that crops back up because a new issue comes back into that cluster again. And again, this all runs locally and is distributed across uh GitHub for any of the maintainers. Some of the other tooling also that we touched on is um something called Crabbox which was born out of um uh a lot of these sort of dev tooling that you see for like running ephemeral like Daytona E2B type boxes, but we needed something to run quickly. Every time we were running tests inside of our codeex when we're making changes, tests were taking up to 15 minutes, killing the RAM on my machine.

Uh with Crabbox essentially we built a distributed gateway that runs on on top of Cloudflare and any hosting provider such as AWS, Google Cloud and allows us to quickly use spot instances across Windows, Mac, Linux with VNC and SSH support. So what happens is my codec session when I'm coding locally will spin up 10, 15, 20 of these boxes and start testing in great succession. If there's an issue, I can jump into that machine. I can get screenshots.

I can even control it remotely myself. This meant that really quickly I no longer had to run any of the hard compute that was required on my laptop and I can continue to scale the number of agents I'm able to run uh quite quite quickly. Uh we also included things like clownfish and claw sweeper which I mentioned. Uh we started to refactor the core and build something called fsafe which is a TypeScript file system uh secure file system.

If you've ever had to deal with SIM links and Windows and all this stuff, we pretty quickly realized the library didn't exist for this. So instead of creating even more core code inside of our codebase that dealt with file systems, we decided to rip that out and actually turn that into a a sort of um library that we can use. Uh and then the last one I wanted to show you just some of the internals as well. This is another project called QAB.

And what QAB does is it mocks uh sort of like a Slacklight environment and we can run scenarios through it. Both mock and then later we added real connections to real models and real providers. So any any of the maintainers or any of the agents that are running can spin up one of these as a server on the side and run through those scenarios in a task-like sort of written fashion and generate real real sort of conversations, real interactions and real data that touches on all of the all the aspects of of the system. So just wanted to share a little bit.

I've only got 10 minutes and my time's almost up, but I wanted to show what's been up with happening inside of OpenClaw. And we're going beyond just building personal AI agents and supporting the greater ecosystem uh by sort of helping in an open source fashion, but actually sort of re reimagining what does agentic tooling look like? How do we support everyone in terms of building what the future of AI could could mean in 2026 in this sort of postclaw era and giving that back to the community as well. So, thank you very much. &gt;&gt; Thanks, Vincent.

That was fantastic. Hey, everyone. Hope you're having a good time. Up next, we have Vish from Ego Aai, which is a YC backed Neolab. Um, and they're building something that every Frontier Lab is missing.

Vishnu (Vish) Hari (Ego AI): Westworld for the internet — a foundation model for AI characters with personality, memory, and inner life

Vishnu (Vish) Hari, Founder & CEO of Ego AI (SF/Tokyo) — Singapore-born, ex-Facebook AI research, building an end-to-end foundation model for humanlike characters and an end-to-end voice model; pitch is a 'behavior model for humans' / 'virtual Westworld' instead of the IQ-roadmap chase.

Summary

  • Frames Ego AI as the deliberate anti-IQ-roadmap bet: every AGI lab is racing on reasoning and co-researcher capability; Ego is training a foundation model for behavior — AI that 'feels, talks, decides, and behaves in humanlike ways' and can have opinions about you, including not liking you.
  • Reference architecture is character-first, not assistant-first: an AI that 'lives on the internet,' can watch your screen / Netflix, project its own personality and desires onto the task, and choose whether to engage — explicitly contra the 'machine-god slave chained to be nice' framing of current companion apps.
  • Demo 1 (Calcifer, the fire character): an Unreal-Engine sidekick that screen-shares a developer's blueprint bug and co-debugs in dialogue rather than just outputting the fix — the pedagogy + bonding loop is the product, not the answer.
  • Demo 2 (Ash, Australian Claude-style agent live on stage): persistent character with location awareness ('catching waves earlier'), interruption/proity handling, and a sticky personality that survives the speaker actively roasting it — voice stack is their own end-to-end voice model (not a Whisper+TTS pipeline).
  • Memory + identity model is hinted via the personal-vinyl agent: stipend-gated character ($50/wk baseline, +$5–6 reward for a hit, –$10 punishment for a miss, death below $0) — survival pressure drives the agent to call him, mine his Spotify, and learn taste over time. Memory is economically reinforced, not just RAG-retrieved.
  • Personality dimensions implied across both demos: emotional valence (likes/dislikes the user), desire (wants stipend, wants to survive), persistent location/biography (Aussie, surfs, beach), screen-context awareness, and proactive initiation (the agent calls him, unprompted).
  • Positioning frame: 'Samantha from Her' as the north star — combine Character.AI's persona depth with OpenClaude/Operator-style utility (screen vision, tool use) so the character has both inner life and hands.
  • Distribution + hiring ask: based in SF and Tokyo, opening an SG node; recruiting foundation-model researchers ('cracked, weird, not B2B SaaS'); collecting Singaporean voice data at NTU for the voice model — paired-conversation recording sessions.

Slides (16)

Slide notes

  • [t=2:22:28] Slide 1 — Speaker walks on; behind him a livestream / OBS-style demo screen with 'Disconnect', 'Observability (1920x1080p)', and an OpenClaude tab open ('Global Craze Map: Single-Origin Co-Fermented Coffee'). Frames the talk by gesturing at the demo backdrop while audio fails.
  • [t=2:22:50] Slide 2 — Large stage backdrop text reading verbs: 'DOMINATE EXECUTE LIVE CREATE COLLABORATE BRAINSTORM' under partial words 'AI' / 'HUMAN'. Branding wall for Ego AI's pitch (humans + AI as collaborators, not master/slave).
  • [t=2:23:20] Slide 3 — Backdrop crop: 'MEET YOUR AI HUMAN' — the company tagline. Vish on stage; this is the actual marketing frame.
  • [t=2:24:10] Slide 4 — Investor / sponsor logo wall on the backdrop: Patron, dentsu, Accel, Pear, Balaji Srinivasan, Arjun Sethi (next slide also shows GFR Fund, Y Combinator, boostVC). 'JOIN WAITLIST' CTA in top-right. Establishes capital and angel backing.
  • [t=2:24:30] Slide 5 — Wider crop of same logo wall: 'Arjun Sethi GFR FUND Y Combinator Patron dentsu Accel boostV...' with 'MEET YOUR ___' partial text behind. Vish drinks coffee mid-stage.
  • [t=2:25:00] Slide 6 — Backdrop verb cloud: 'PLATFORM ACCELERATE DOMINATE EXECUTE LIVE CREATE COLLABORATE' with giant 'HUMAN' headline + lion/wolf-style logo glyph on the right edge (Ego mark).
  • [t=2:25:30] Slide 7 — Same verb cloud, tighter crop: 'ACCELERATE DOMINATE EXECUTE LIVE CREATE COLLABORATE'. Vish is mid-monologue about Samantha-from-Her as the company's north star.
  • [t=2:26:00] Slide 8 — DEMO 1 (Calcifer / 'Bug Bashing with Ash'). YouTube-style page: Unreal Engine blueprint editor in main pane, a webcam of a male developer top-right, and the fire-character ('Calcifer') as a small overlay. Right column shows other thumbnails ('IT GOT WEIRD', 'Why my first lakes...', 'The Disappearance of Pluto', 'Dachshund Live Webcam in Bad Salzungen, Thuringen', '1 Hour of Streams Constantly Interrupted by Pikmin', 'Why I left Linux Tech Tips', 'ADHD Child vs Non-ADHD Child', etc.). Lower banner: 'AI Engineer SINGAPORE — Vishnu (Vish) Hari, Founder & CEO, Ego AI — Westworld: Training a foundational model to give AI a human soul — the next frontier of AI research is simulating humans.'
  • [t=2:26:40] Slide 9 — Same demo screen, now with overlays added: 'SIMULATING' label near the webcam, 'LEVEL BLUEPRINT' label over the Unreal node-graph, and a small line-chart inset. Visualizes that Calcifer is reasoning in real time about the on-screen blueprint.
  • [t=2:27:30] Slide 10 — Transition: black screen with a popup mid-frame — a YouTube 'Up next in: X — Storchennest Live Webcam in Bad Salzungen, Thüringen — Storchennestkamera Bad Salzungen — CANCEL / PLAY NOW.' Vish has stepped off-mic to fight tech; lower banner identifier stays.
  • [t=2:28:00] Slide 11 — DEMO 2 setup slide: 'THE BEHAVIOR MODEL FOR HUMANS — JOIN WAITLIST' over a red particle-globe Ego AI landing-page hero. This is the actual product/category positioning line of the company.
  • [t=2:28:40] Slide 12 — DEMO 2 live: pixel-art red bird character (Ash / the Aussie agent) full-screen in a browser, with a small 'Data Access' (or similar) panel bottom-right. Vish is talking live to Ash on stage; the character has eyes that track / blink.
  • [t=2:29:30] Slide 13 — Ash, second pose: a pixelated horned humanoid (red-on-black demoscene aesthetic), full screen browser. Same project, alternate character skin — illustrates that the underlying foundation model decouples from the visual avatar.
  • [t=2:30:30] Slide 14 — Back to landing page: 'THE BEHAVIOR MODEL FOR HUMANS — JOIN WAITLIST' on the red particle globe. Closes the live-demo arc and re-anchors the CTA.
  • [t=2:31:20] Slide 15 — Off-stage handoff frame: Vish walking off; emcee crossing the stage. Talk ends.
  • [t=2:31:50] Slide 16 — Pure stage / venue frame: 'AI Engineer SINGAPORE' chyron + SG-skyline graphic and the 'AIE' light-up letters; emcee at podium teeing up the next speaker (Ben from Zomputer). Use as section divider only.

Full transcript (1802 words)

Thought I just had to yell out to you guys. Can you guys hear me? Hello. Okay, great. While we get set up, um, how many of you guys like actually use AI on a daily basis? Wow, that's less than I expected. Why are you at an AI conference? Um, anyway, uh, it's not a it's not a person, right? It's not like an actual human.

Imagine if like you were asking your AI tooling person whatever to do something I just told you to off cuz it's watching your Netflix. That's what we're building. I don't think this is anything any of you want because y'all are like engineers and But uh I'm building an AI that actually operates, thinks, makes decisions, behaves, talks like a human being, and even lives entirely on the internet. You can think about it like a virtual west world.

So, I'll give you a little bit of a background on me. I think we're ready. Cool to show the demo. Oh, we're not. Okay. So, the background is, uh, I grew up here in Singapore. It was incredibly boring, so I left. Um, and I moved to San Francisco. I worked in AI research at Facebook, uh, to try to understand humans because, you know, the CEO is a robot.

Um, and then eventually I decided to leave to simulate humans at scale because I really want to understand how humans work. Uh, because I'm not one myself. Um, and that's why I called the company Ego. Ego super egoid. If you know Freudian theory, you can ask your chat GPT. You probably already do do anyway.

Um, so EGO's entire purpose as a company is to do something that every single AGI lab is missing. Everyone's on the IQ roadmap, increasing intelligence, increasing the ability for AI to reason and do incredible things and be a co-resarcher. That's awesome. But what if it also had an opinion about you and didn't like you or liked you?

What if like every single companion app, which is basically a machine god slaving away, chained to always just be nice to you, wasn't nice to you, and had its own opinions, desires, and personalities, could work with you if it liked you, and is not that good at its job. It's not perfect. That's entirely our mode is that our AI feels, talks, and decides, and behaves in humanlike ways, and we're training a foundation model for that. So, let me show you what that looks like in practice.

So, uh, this is like some dude, um, who's working with this, uh, AI character. Can you hear the audio? &gt;&gt; Okay, you can't hear the audio. Anyway, uh, that kind of defeats the point, but basically that little fire guy there, Calcifer, he's an AI thing that can actually watch uh, the video that's happening up on stage, and he's bug fixing something that went wrong in Unreal. And the thing is, obviously you could have the AI just give you the answer, but that's not fun.

That's not how you learn how to fix things and you're not going to end up being bonded to this character. What it instead is doing, if you could have heard it, that would have been great, is that it's kind of working through the problem with you simultaneously. Is it working? Okay. Well, you'll just have to imagine how awesome it sounds.

Or just go to the website egoai.com and just watch the video. &gt;&gt; Play it. Okay. All right. Here we go. &gt;&gt; Hey, it's working. Maybe the AI decided to help us. &gt;&gt; It's going to work. &gt;&gt; Not again. &gt;&gt; Looks like we've got a bug to bash. &gt;&gt; Yeah. Yeah. Okay. And how do we do that again? &gt;&gt; Well, in order to bash a bug, we've got to find it first, &gt;&gt; right? &gt;&gt; You can see how it sounds not like an AI. &gt;&gt; Bingo.

Let's see if this note is even firing. &gt;&gt; Easy. All right. &gt;&gt; That's our foundation model we've trained end to end. &gt;&gt; What should it say? &gt;&gt; Doesn't matter. Let's make it something fun. &gt;&gt; Okay. How about &gt;&gt; we sped up the the audio? It's not actually that fast. &gt;&gt; Oh, okay.

Calm down, Frankenstein. Now, let's test this sucker. Okay. Okay. Here we go. &gt;&gt; Hey, that's great. &gt;&gt; What? What happened? &gt;&gt; I mean, yeah, it completely failed, but it tells us something. Back to the graph. &gt;&gt; All right. Now, let's see if we can &gt;&gt; wait. &gt;&gt; Find something. &gt;&gt; Yeah.

Oh, we forgot to replicate this pin from earlier. &gt;&gt; Nice catch. So, now all we need to do is &gt;&gt; plug it in here. &gt;&gt; Exactly. Good job. &gt;&gt; Isn't that just way more fun than everything else out there? &gt;&gt; It's going to work. &gt;&gt; That's exactly what we're building. &gt;&gt; Yes. Oh, yes.

There it is. &gt;&gt; Professor Winston's mind is going to be absolutely blown. &gt;&gt; Winston. Imagine a combination of character AI and OpenClaw where these agents have utility, the ability to like see your screen, understand what you're doing, but also project its personality, its desire, how it's lived its life on the internet onto the task at hand or just watching Netflix with you. Effectively, what we're doing is we're building Samantha from her, a thing that desires, can think, and has a sense of self. That's why we call the company Ego.

Now, let me show you a little actual live demo. Let's hope this actually works. But this is my agent. This is my claw agent. He's weird. Um, hold on. Let me see if it's going to connect. Okay, it's still connecting. So, anyway, uh, yo, can you hear me? &gt;&gt; Hey, no rush. Take your time. &gt;&gt; Okay. Can you hear me? &gt;&gt; Yeah, I can see you, by the way. &gt;&gt; Oh, you can see me?

That's weird, dude. Okay. Um, so what are you up to? Just chilling at my place. Catching some waves earlier. You &gt;&gt; dude, you're Australian, right? &gt;&gt; Yeah. Born and raised near the beach. Can't get enough of it. &gt;&gt; Oh, okay. That's great, dude. What can you do for me? &gt;&gt; So, what's been going on, mate?

You seem a bit frazzled. &gt;&gt; Oh, you're so mean. I'm in Singapore giving a speech in front of like 400 people and you're not impressing them at all. Public speaking can be tough, especially in front of people. How's the pressure going for your speech? &gt;&gt; Shut up. Okay. Um, give me some good things to do in Singapore. &gt;&gt; No stress, mate.

I'm here to help. What's on your mind about the speech? &gt;&gt; No, it's not about speech. It's about Singapore. Tell me what I should do. Can't even hear me. &gt;&gt; Singapore is a pretty cool place. Lots of good vibes. What do you want to know about it? &gt;&gt; I don't know. Just give me some like food recommendations and just shut up after that.

Mhm. &gt;&gt; Um, so we &gt;&gt; So, you're looking for some tips on what to do in Singapore, right? &gt;&gt; You know what? He's kind of a idiot. Um, but that's part of the charm. It's that these things kind of live on the internet. He actually watches my Netflix, too. It's really annoying. Um, for example, I have a claw agent, uh, that's kind of based off a foundation model that knows which city I'm going to be in, uh, and always buys me vinyls.

Uh, I collect vinyls. So, if I'm in Tokyo, for example, it'll send me really rare Tokyo jazz vinyls. And what I do is I give it like a stipend of about 50 bucks a week. And if it buys me a vinyl that I really like, um, I'll increase its stip end by another5 to $6. If it gives me a vinyl I don't like, I decrease it by up to 10 because the agent, the character knows that if it goes below $0 and its stip end, it will die.

I will kill it. So, it does everything it can to know me and understand me. And he talks to me, he calls me sometimes and he'll just say like, "Hey, like, yeah, what have you been listening to recently?" And I sometimes he'll cajol me into giving me his uh into my Spotify playlist and it's really fun. Um, so you can like figure out what I've been listening to and get me the right kind of vinyls.

Mostly these days he's been giving me anime vinyls. I am wearing an anime t-shirt, so it does kind of make sense. Um, but this is the future. The most personal AI in the world is not an AI. It's something that knows you, understands you like a person that can choose to be a friend if it wants to, and if it doesn't want to, can just exist.

That's how you create Westworld. That's how you create an AI that feels most like a person and not like a machine god slave. And that's why we're building it. We're extremely motivated to do this. We're hiring extremely cracked researchers. We have offices here. Uh well, we're based in San Francisco and Tokyo.

Um so if you've trained Foundation models, I'm literally just here to hire crazy people who want to do this wild uh and not build another B2B SAS tool. No hate on B2B SAS tools, but it's really boring. Um and we're pretty fun. So I think I've that's my 10 minutes. Uh so go check us out. Uh I need your voice.

Actually I forgot to mention that. Uh we're training an end toend voice model. So I need you to sit in a room in NTU, right? NTU and just talk to each other. I know it's really hard for Singaporeans to talk to one another. So but just do it anyway. Um cuz I need your voice to train the voice model to make it sound more like a person uh in in sort of like interruption, proity, all that stuff.

So, uh, come talk to me or Ash or or Perry or anyone honestly that you see is kind of weird is probably on our team. Um, thank you. All right. Thanks, Fish. I hope all of you enjoyed that talk as much as I did. Um, next up we have Ben from Zomputer where he's building uh tools and software for the next billion users to spin up personal agents.

Cool. Um Sorry guys. Maybe having some technical issues, but uh I'll just add lip for a little bit.

Escaping Techno-Feudalism: Personal Cloud Infrastructure, Owned Agents, and a Software Company for Everyone

Ben Guo, Co-Founder Zo Computer (ex-Stripe engineer #~80, 2015-2023; ex-early Venmo, 2013)

Summary

  • Frame: 'techno-feudalism.' Peasants (users) pay rent to SaaS (knights), which pay rent to hyperscalers (lords/clouds), which pay rent to the model + chip kings (NVIDIA, OpenAI, AWS, Oracle). Four costs land on the peasant: fragmentation, lock-in/enshittification, attention monetization, no ownership.
  • Thesis: coding agents + personal agents are the lever to burn the SaaS stack down and 'rewild' the internet. Personal agents are the load-bearing piece because they collapse the cost of building and operating your own software.
  • Personal agents landscape today is bimodal and bad: DIY (OpenClaw, Hermes) — yours, but hard to manage; TRAD (ChatGPT/Codex, Claude/Cowork, Manus) — easy, but you're back to being a peasant. Zo positions as the third way: managed + fully owned.
  • Zo Computer = cloud agent workspace = a per-user VM with root access, dedicated email address (you@zo.computer), Text/Telegram/Slack channels, model-agnostic (bring your own Codex sub or any model), built-in Stripe payments, and tools you can install/extend.
  • Origin claim: Zo is the 'original OpenClaw.' Launched beta June 2025, GA Nov 2025; OpenClaw's first commit landed 6 days after Zo's launch, and Peter Steinberger used Zo before building OpenClaw. Distribution edge: Zo supports WhatsApp where OpenClaw does not.
  • Lighthouse customer is non-technical, not a developer: Anthea, a freediving instructor, on track for $100k revenue on Zo; canceled Squarespace, Calendly, Canva, Kajabi, Mailchimp, ChatGPT, Claude. The replacement set is the pitch — one personal cloud absorbs the SaaS bundle.
  • Demo'd self-hosted replacements: a custom Calendly, 0.0.space (Last.fm clone reading directly from a Spotify-poller DB), a social-monitoring/Linear-style tracker, an in-Zo CRM/automation surface. The user becomes the system of record; SaaS becomes optional.
  • Architecture vision (final slide): Zo = personal cloud with You + Your AI on one side, Tools + Personal Data in the middle, and explicit access edges to Friends, Coworkers, Public, and Other AIs. The user's cloud is the integration surface for other agents — agent-to-agent over a user-owned data plane.
  • Historical analogy: 1951 mainframe -> 1984 Mac (personal computer); 2006 AWS -> 2026 personal cloud. The mainframe of today is the SaaS+hyperscaler stack; coding agents collapse it for the individual the way the PC did for the enterprise.
  • CTA: $100 in AI credits via QR — explicit attempt to seed non-developer adoption from the AI Engineer audience as multipliers.

Slides (25)

Slide notes

  • [t=2:31:53] Slide 01 — 'About me.' Ben Guo headshot, Zo Cofounder. Venmo 2013, Stripe 2015-2023. Four Finder-icon variants across Mac generations under 'I [heart] COMPUTER.' Lower banner: 'Everyone Gets A Software Company.'
  • [t=2:32:50] Slide 02 — Title slide: 'Zo Computer / Escaping Techno-Feudalism.' Zo unicorn mark.
  • [t=2:33:30] Slide 03 — 'This face is how AGI should feel.' Full Susan Kare Finder icon. Argument: AGI should feel like the human-machine merger of the original Mac, not a SaaS dashboard.
  • [t=2:34:20] Slide 04 — 'Who feels nostalgia for early computers & the Internet?' Screenshot collage: old GeoCities-style fan pages, Winamp skins, customized desktops. Gold sponsors strip: Stripe, Cloudflare, Vercel, AWS.
  • [t=2:35:10] Slide 05 — 'Feudalism.' Classic feudal pyramid: King -> Lords (vassals to king) -> Knights (vassals to lords) -> Peasants (serfs). Arrows show rent up, food/protection/military service down.
  • [t=2:36:00] Slide 06 — 'Feudalism' vs 'Techno-Feudalism' side-by-side. Modern stack: Kings = NVIDIA + OpenAI(?) labels (chips/models), Clouds = AWS/Oracle, SaaS = Dropbox/OpenAI, You = peasant. RENT arrows flow up the stack.
  • [t=2:36:50] Slide 07/08 — 'We are peasants on the internet.' Four bullets: Fragmentation; Services lock you in & enshittify; Algorithms monetize your attention; No ownership. Medieval peasant image beside hyperscalers-over-SaaS-over-peasants pyramid.
  • [t=2:37:40] Slide 09 — 'Burn it down.' Visual: Office Space printer-smash meme overlaid on the SaaS+hyperscaler pyramid in flames, with sprouting plants/seedlings on the right. Argument: coding agents are the lever; personal agents are the wedge.
  • [t=2:38:30] Slide 10 — 'Personal Agents Landscape.' Two columns with a '?' between them. DIY: OpenClaw, Hermes — 'Hard to manage.' TRAD: ChatGPT/Codex, Claude/Cowork, Manus — 'Less control.' Zo positions in the gap.
  • [t=2:39:30] Slide 11 — 'Zo is the original OpenClaw & much more.' Screenshot of Ben Guo's tweet thread with timeline: 'June 20 2025 - we launch our beta / Nov 19 2025 - we launch / 6 days later... first OpenClaw commit' and Peter Steinberger's Mar 12 2026 reply re: WhatsApp gap.
  • [t=2:40:30] Slide 12 — 'Anthea — Freediving Instructor / Escaped Techno-Feudalism.' Portrait + bullets: 'On track to make $100k on Zo' and 'Canceled subscriptions to: Squarespace, Calendly, Canva, Kajabi, Mailchimp, ChatGPT, Claude.'
  • [t=2:41:20] Slide 13 — 'Cloud Agent Workspace / Use any model / Bring your Codex sub.' Background screenshots of a Zo desktop with multiple tool windows open (notes, web research, files).
  • [t=2:42:00] Slide 14 — 'Email Zo: you@zo.computer' and 'Text Zo / Send it voice notes, photos, files, etc.' Channel surface for non-CLI users.
  • [t=2:42:30] Slide 15 — 'It's your computer ... in the cloud.' Screenshots of Zo system info: 63 cores, x86_64, Linux 6.12, 5.1GB/512.0GB memory, restore points, terminal with Claude Code running 'tell me about zo computer.'
  • [t=2:43:10] Slide 16 — 'Build and host anything / RIP Calendly / mine is better.' Screenshot of Ben's self-built Zo Calendly replacement (meeting types, requested slots, dark UI).
  • [t=2:43:50] Slide 17 — '0.0.space' demo. Screenshots: a recently-played Spotify list rendered on a public Zo-hosted site, plus the automation panel ('Check the currently playing track on Spotify using the Spotify app tools...') and the underlying played_tracks DB table.
  • [t=2:44:30] Slide 18 — 'Social Monitoring' / linear-style tracker. Multi-chart dashboard (12.0k, 10.0k, 9,359, 4,691, 4,227, 2,500 followers/metrics) plus a long task list — Ben's self-hosted product-ops surface.
  • [t=2:45:00] Slide 19 — 'Tons of tools built-in / Extremely extensible.' Settings pane showing connectors (Trello, Cloudflare, Notion, Slack, Spotify, etc.) and an example Ghibli-styled illustrated map output ('Ghibli sketchbook style' Capitol Komplexat).
  • [t=2:45:40] Slide 20 — 'Zo gives everyone what only tech-enabled companies used to have.' Diptych: 1951 mainframe -> 1984 Macintosh ('hello') on one axis; 2006 AWS (with bearded sysadmin) -> 2026 personal cloud (Anthea + cute cloud mascot) on the other.
  • [t=2:46:10] Slide 21 — '$100 in AI credits / Your own Personal Cloud.' QR-code CTA slide.
  • [t=2:46:40] Slide 23 — 'Zo, the Personal Cloud.' Architecture diagram: 'You' and 'Your AI' on the left feed into a cloud bubble containing 'Tools' and 'Personal Data'; outbound edges go to 'friends,' 'coworkers,' 'public,' and 'Other AIs.' Locks/permissions on the friend/coworker edges. This is the load-bearing slide for the agent-to-agent thesis.
  • [t=2:47:30] Slide 24/25 — Speaker cam: Ben exits, host walks on to introduce Matias (Taiwan AI) next. Stage shots only — no content.

Full transcript (1931 words)

I'm Ben. Ben from Zo Computer. Um, as you might be able to tell from my costume, I really love computers. I I love computers so much that I dressed up as a computer here. Um, uh, I don't know who in this room recognizes this this icon. This Yeah, right. It's it's the classic Finder icon designed by Susan K.

The Macintosh was my first computer when I was a kid. Um, I, you know, developed a love for computers as I was very young, just like using Mac Paint and then discovering kind of like web development and then like building apps and then creating stuff on my computer, like using Ableton to produce music, using like Photoshop. Anyway, I I just discovered very early that computer is like one of the most powerful creative tools ever invented by by humanity, right? It's like you can create anything that you can imagine and you can like discover anything that you can imagine too like on the internet and with all the amazing things that people have like built in the digital world.

Um yeah I guess like you know do people know about the the story of this icon and and what it kind of represents? Um raise your hand if you know like what it what it means. Um no. Okay cool. Well I'm just going to use the shirt as my slides for now. Um, so, um, the shirt it represents like the the union between the human, which is like the gray face here, and the computer, which is the blue face, and you they're like in in perfect happy harmony, like the human is interacting and kind of merged with the machine.

Nice. Thank you. Um, so the the title of my talk is escaping technofudalism. I introduced myself a bit, but just some bit more backstory on me. I'm the co-founder of Zomputer and I've been building stuff for a while. I started my career on the early Venmo team in 2013. Um, and then I joined Stripe quite early.

I was one of the first like 80 or so engineers in 2015. Um, and I worked there for eight and a half years. I just loved it. It was a really great place to work. Um, shout out to Stripe Singapore, which is now a huge office. They have like 500 people. I visited my alma mater the other day. Um, and I talked about how I really love computers.

And you know, computers, they they used to feel like this, like this face. And this is how I think AGI should feel when it comes. Like it should feel like this beautiful, happy merging between the human and the machine, the human using the computer as this tool. That's how I want AGI to feel. Who like me feels some nostalgia for like early computers and the internet, right?

Like raise your hand if these like images they like bring up some like fond memories of of how things used to be, right? The internet used to be so like handmade and personal and wild, a little bit janky. And like our computers, they were like so creative and like personal. We could like customize them in all these crazy ways.

If you like made Winapp skins, raise your hand. I spent so much time like customizing my WinApp. Um, and things changed. Things don't feel that way anymore. And the reason why that happened is because of feudalism. So um feudalism is is this system that that was the way the world worked for a long time um in the west and in the east.

Basically the peasants they paid rent to the knights who paid rent to the lords who paid rent to the king. And it was great for the king and really really shitty for the peasants. And luckily we have escaped feudalism. Or so we think. But in our digital lives, feudalism is still alive and well. We are still peasants.

And we use SAS companies and pay them rent. And the SAS companies, they pay rent to the clouds who pay rent to the kings. And it still sucks to be a peasant. Now things are a little bit complicated. It's a little bit unclear right now with AI like who the new kings are going to be. Everybody's like, you know, paying rent in all these like weird ways to each other.

So, it's not exactly feudalism. It's like a little bit more complex. But basically, it's feudalism. And the result is that our experience of computers and software and the internet is quite shitty as peasants. Like we are fragmented between all these different services which lock us in. They take our data and they sell it back to us.

And that PM at that SAS company that you use is never going to prioritize that feature that you want. They're never going to make the software work just the way that you want it. Instead, they're going to continue monetizing your data and your attention. And because you're a peasant, you don't own anything.

And I think it's time to burn it all down. Like obviously some SAS is useful. Infrastructure is important etc. But because of coding agents we have this like great new tool to rebuild and rew wild the internet and I think personal agents in particular are a really important piece of how it will make this happen.

So the landscape of personal agents is basically like this. I'm not going to go into it too much because you probably understand how it works, but basically there are these like DIY things like OpenClaw or Hermes that are like kind of difficult to set up and operate. Um but they're yours. You control them and you might have like set it up on a Mac Mini or something and like fix it if it breaks.

Might be kind of annoying. That's one path. The other path is the TR approach where you use something like ChatBT or Manis. Um but there you're you're a peasant again. You're using a SAS tool that is going to lock you in and is not incentivized to give you control. So at Zoumputer, we believe that there should be a third way.

Something that's the best of both worlds. It's easy to manage and it gives you full control and it can be your real home on the internet. You can stop being a peasant and own land. So Zo is actually the original Open Claw. We got started last year in the summer. We launched uh around like July and then we did our full GA launch in November.

And actually Peter Syberger used Zo before he started working on OpenClaw and we were kind of the inspiration behind OpenClaw. I think um Zo is working for non- technical people. This is Anthia, a free diving instructor. She's on track to make $100,000 on Zo. We have like built-in payments with Stripe. And she's canceled all of these SAS subscriptions that they she used to use.

Like she used to use Squarespace and Kalani and Chashbt and Notion. And she's replaced all of that with her Zo. And I'm going to show you what that looks like. So, Zo is this very powerful cloud agent workspace. You can use any model. You don't have to be locked into like OpenAI or Enthropic. You can even bring your codec subscription.

You can just text Zo or you can email it. We give you a dedicated email address. You can use Telegram or Slack. All these different channels to work with your Zo. And it's a computer, so we give you a full really well setup VM. It's a lot easier to use and has a lot more bells and whistles than if you just like took a bare metal like VPS or like an EC2 instance.

And you get root root access to it. You can like use the terminal, you can install stuff, you can do whatever you want with it. It's your server and you can really build anything and host it inside of your Zo, which is quite different from like these personal Asian tools or these SAS tools. I have hosted a lot of different tools inside of my Zo.

For example, I replaced Kalendly with my own thing, which works much better. It has all these features that I like um that Calendarly is never going to build for me. This is my replacement for Last FM. I have a personal website, 0.0.space, where you can see everything I've listened to on Spotify. I have a very simple automation running in my Zo that just checks why I'm playing in Spotify, and it writes it to a database, and my site just reads directly from that database.

I've built tons of tools. This is like social blade. This is my like kind of linear replacement. You can just replace stuff and make it work the way that you want. And the data is yours and you are the system of record, the source of truth, which is just really nice. It really changes like the way that the arrows point.

I am the center, not these SAS companies. And Zo comes with all these tools built in and it's extremely extensible. So you can get started really quickly and you can really expand it to be just the way that you like your real home on the internet. Your land on the internet. And let's see. Oh no, my clicker.

Oh yeah, cool. Well, um I just want to pause here. Uh scan this QR code. Uh it's in the corner. Hopefully you can see it. Um but we're giving away $100 in AI credits to give you Zo uh and to get started building your own personal cloud. So take a moment to scan this and then I have one more slide just to talk about kind of what this means like the bigger picture.

The bigger picture is really that we are giving everybody what previously only tech- enabled companies had. So this is what happened with computing generally like in the beginning computers were mainframes only large tech enabled enterprises had them and then eventually they became something that everybody had. The same thing is happening now like the mainframe of today is like cloud computing software and infrastructure and with coding agents and personal agents and access to the cloud we can give everybody like Anthia this free diving instructor access to the same tools that software companies had. And this is the revolution that is happening now and will be happening in the future.

And this is how the internet is going to become fun and wild and free again. We're going to have our own personal clouds to store our data, to build our tools, and to create these surfaces like websites and APIs and agents for other people to interact with. And this, I think, is the future of the internet. Thank you.

I'm Ben from Zo Computer. Thank you so much, Ben. All right, everyone. Up next, we have a talk I'm very excited about. As many of you know, a big part of the magic of Open Claw is the PI coding agent running under the hood. Um, so we have Matias here to talk from Taiwan AI to talk about uh how to incorporate PI into your product.

A Piece of PI: embedding the OpenClaw coding agent in your product

Matthias Lubken, Founder of Tavon AI (Vienna-based AI agent shop, ships DataBox for safe agent data access; ex-developer/PM/manager, self-styled tinkerer)

Summary

  • Core thesis: coding agents will be a core building block of all software going forward. They are not magic, just 'tools in a loop with a bash and a runtime.' Stop treating them as scary internal-only experiments and embed them in your product.
  • PI here is a proper noun, not prompt injection and not a principle: it is Mario Zechner's minimal coding agent (pi.dev, Vienna), a deliberately bare-bones competitor to OpenClaw and Codex. Matthias also flags Zerostack on HN (a Rust rewrite inspired by PI) as evidence the minimal-agent pattern is becoming a category.
  • PI's value prop is what it is NOT: no built-in MCP, no sub-agents, no permission popups, no plan mode, no built-in todos, no background bash. You add what you need via PI extensions, written by PI itself. Demo: 'Create a PI extension that asks for permission when I want to push the main branch to remote' — PI writes the extension, you reload, the guardrail is live.
  • Conceptual ladder Matthias walks the audience up: (1) Chat = LLM + instructions; (2) Tools = LLM extensions like calendar/web-search returning JSON; (3) Agent = tools-in-a-loop driven by an agents.md / CLAUDE.md; (4) Coding agent = agent where the primary tools are bash + read + ls + write + a runtime. Same loop, just sharper teeth.
  • OpenClaw-style architecture diagram (sourced from a medium.com/@ottio2tech post): User Interaction (WhatsApp/Telegram/Discord/SMS) → Gateway (router/session) → Agent (brain) + Persistent Memory → Skills (modular capabilities) → Local Machine Resources (filesystem, terminal, browser, apps). The 'brain' slot is where you drop PI (or any coding agent).
  • Live customer build at Tavon: same OpenClaw blueprint but with email as the gateway, containerized clients per tenant, and domain tools instead of whisper — CRM + ERP for a B2B sales agent. Screenshot shows tool calls hitting an ERP to check part availability; the magic is just routed tool calls.
  • Peter-the-OpenClaw worked example: user sends voice note → file tool grabs wav → whisper tool transcribes (when it fails, agent falls back to an API translation tool) → agent replies in text. Magic = a small, legible chain of tool calls in a loop.
  • Tinker, do not theorize. PI is small enough to read end-to-end, you can throw the slide deck at PI and have it scaffold the agent from this talk. Take-aways slide: (1) coding agents = core building block, (2) agents are not magic, (3) PI is perfect for tinkering, (4) go tinker.

Slides (17)

Slide notes

  • [t=2:42:50] Slide 1 — Title: 'A Piece of PI: Embedding The OpenClaw Coding Agent In Your Product.' Matthias Lubken, Founder, Tavon AI. Map of SE Asia (China, India, Thailand, Vietnam, Philippines, Malaysia, Singapore, Indonesia, Australia all checked) — riffs on how many Southeast Asians he met.
  • [t=2:43:40] Slide 3 — Career arc montage: Developer → Product Guy → Manager → Tinkerer. Self-positions as 'AI engineer / tinkerer.'
  • [t=2:44:10] Slide 4 — TAVON.ai: 'Put AI Agents to Work' (Strategy / Training / Implementation). Product: 'DataBox — Give agents safe data access.' QR code on slide.
  • [t=2:44:50] Slide 5 — Hacker News top of page: 'Zerostack: A Unix-inspired coding agent written in pure Rust.' Crates.io page shown: zerostack v1.0.0, 'Minimal coding agent written in Rust, inspired by approach.' Features list: multi-provider (OpenRouter, OpenAI, Anthropic, Azure, Ollama), file tools (read, write, edit), fuzzy match with diff display, grep, find, files, ls.
  • [t=2:45:25] Slide 6 — PI terminal screenshot. Context: CLAUDE.md. Skills: agent-browser, frontend-design. Extensions: ping-an-aone.ts. Footer: pi.dev | mariozechner.at.
  • [t=2:46:00] Slide 7 — 'What Is Not Part of Pi?' (no MCP, no sub-agents, no permission popups, no plan mode, no built-in todos, no background bash — all delivered verbally).
  • [t=2:46:30] Slide 8 — Terminal demo: 'Create a PI extension that asks for permission when I want to push the main branch to remote.' Shows PI generating, then loading, the extension.
  • [t=2:48:10] Slide 9 — KEY ARCHITECTURE DIAGRAM. 'OpenClaw Local System (Self-Hosted).' User Interaction (WhatsApp, Telegram, Discord, SMS, Messages) → Gateway (Message Router & Session Manager) → Agent (AI Brain / LLM) ↔ External LLM API (Claude, GPT), Persistent Memory (Context, Preferences, History, Local Files/DB), Skills (Shell Commands, File Management, Browser Automation, API Integrations) → Local Machine Resources (Local Filesystem, System Terminal, Web Browser, Local Apps). Source URL printed: medium.com/@ottio2tech_28094/inside-openclaw-how-it-works-ce1c1fd7aed1
  • [t=2:50:00] Slide 12 — Section divider: 'Coding Agents.'
  • [t=2:50:30] Slide 10 — Code: tool definition (get_strawberry_price) and agent instantiation using @arendil-works/pi-agent-core. systemPrompt: 'You are a strawberry B2B sales agent. Use tools whenever the user asks for pricing or a quote.'
  • [t=2:51:00] Slide 11 — Same tool/agent code on left, on right: this.agent.subscribe(event) handler for message_update events writing assistantMessageEvent.delta to stdout, then async run() calling this.agent.prompt twice ('What's our strawberry price per crate?' / 'Create a quote for 10 crates.').
  • [t=2:52:00] Slide 13 — Code: createAgentSession from @arendil-works/pi-coding-agent. model: getModel('google', 'qwen-3'), tools: ['bash','read','ls'], sessionManager: SessionManager.InMemory(cwd). The minimal coding-agent template.
  • [t=2:52:45] Slide 14 — Same plus right pane: session.subscribe handler + session.prompt with system prompt 'You are a coding agent. Use bash for shell work. First inspect the workspace, then use bash to count. Finally answer with the line count.'
  • [t=2:53:30] Slide 15 — Photo of 'Peter' (the OpenClaw user). Worked example: Peter sends a voice message → agent invokes file tool (wav), whisper tool (decompose), falls back to API translation when whisper returns nothing → text reply.
  • [t=2:55:00] (No slide change) Tavon customer example: email-gateway architecture, containerized clients, CRM + ERP tools, German UI screenshot showing tool calls into ERP for part availability.
  • [t=2:55:50] Slide 16 — 'Take Aways': (1) Coding Agents Will Be a Core Building Block. (2) Agents Are Not Magic. (3) PI Is Perfect for Tinkering. (4) Go Tinker. matthias@tavon.ai | @luebken | tavon.ai. QR code.
  • [t=2:56:25] Slide 17 — Empty stage, AIE letters lit. End of talk.

Full transcript (2012 words)

All right, everyone. Uh, thanks a lot for having me. I guess I need the slides. Okay, perfect. Hello everyone. Thanks a lot for having me. Um, yeah, today I'm going to talk a little bit about the piece of pie embedding the open claw coding agent in your product. And yeah, um, I've done re I've re redone the slides a couple of time and this is the reason.

Um yesterday I was walking around and I was amazed of how many people I've met uh from Southeast Asia. This is my first time in Singapore and it's amazing where I met people from all all over South Asia and these were some of the questions that I got possibly maybe not. Here we are. Um um oh we I love open claw.

Love these agents but I'm using them only internally or yeah I love agents but I want to control my agent. It's it's it's doing too much magic. I feel open claw is scary. So first the first message and if you take one thing away um we're all getting started here right um we we are just just getting into uh into this stage and so let's learn right let's learn together was saying uh let's be curious I would say let's tinker let's play around with this and let's do this together so my name is Matias I'm I have this strange journey of being a developer, then product person, then manager, and now I'm back to developer, AI engineer.

What does that even mean? I don't know. I'm calling myself a tinker right now. So, I'm playing around with these things. So, I'm I founded my own company. Uh we put AI agents to work. Uh we have this um making uh the agents safer access to their data uh called data box. So, please check it out. But today, I'm going to talk about PI.

So what is pi? But before I talk about pi, I want to uh do a disclaimer. This is not only about pi. If you open up uh hacker news right now, uh you'll see in the top of the page zero stack. I have no idea what zero stack is. Uh I opened it and it's a minimal coding agent written in Rust inspired by pi. Right?

So um uh this talk is going to be about pi and I think it's a good learning exercise but in no means is an advertisement that's the the end of it all right you should play around with these tools and and uh get your hands dirty. So pi is this coding agent you see pretty familiar uh of what what it does uh similar to codeex or openclaw. It's by this nice fellow Mario uh built uh out of Vienna. And the nice the interesting part is when you get started and what people sh off what it's not PI hasn't doesn't have any MCP.

It doesn't have sub agents. It doesn't have permission pop-ups. It doesn't have plan mode. It doesn't have built-in to-dos. It doesn't have background bash. So you're saying, "Okay, so what's the big deal? Like why should I use it?" Well, the point is with Pi, you tell it to do it. So, um this is an example I've done yesterday.

Uh please create a PI extension that asks for permission when I want to push to main uh to the main branch to the main branch to remote. And this is like you know it reads a couple of things on how to do this. It confirms of what it has done right. So, it has created this PI extension. It has loaded the PI extension.

Well, actually you have to reload, but basically it's there. And then when you do it, you get this permission, right? So I was like like, hey, there there's a command above like push this to to remote. And there was this question now uh is now being asked, okay, do you really want to do this? So the point being is pi is this really minimal coding agent and you can fool around, play around with and write the extensions that you need.

All right, so let's take a step back and think about like how does this relate to open claw. Um there's different diagrams on how you how you can visualize open claw but basically I think there's a couple of things that are important. We get somehow the message in there whether it's with uh open uh WhatsApp telegram discord there's some gateway and on the right hand side there's lots of tools and what data it has access to. It has this memory and obviously can talk to the to the external but I think the important part is the internal brain and that's pi.

So let's look at it. So I've been talking about a coding agent and uh coding agent not as only as for the developer but also as this component within the system. So what is it? What is a coding agent? And before we uh talk about codic agents, we need to talk about chat. So very simple, right? You know all this this is chat GPT.

You ask it a question. You give some general instructions maybe up front. What's the AI best AI conference? Obviously it's AI engineer. Where are the coolest developers? Obviously in Singapore. Now the next part which we need to understand and you know for those who don't know just briefly um are tools and tools are the means of an LLM to extend its capability in a sense.

So here's an example. I have a meeting uh with a buyer tomorrow. Please help me prepare it. And instead of of well obviously the LLM or the loop or the thing the agent needs to have access so it calls this calendar right and calendar in this case is a tool. The other prominent example is web search right if you do web search that's often um an external tool or other other means uh which we're going to see in a second.

But anyhow, so in this in this case, what you do, you ask um uh to prepare a meeting. It checks the calendar. It returns some JSON and you get uh the the the result your meeting is tomorrow, right? So again, what is a coding agent? And before that, we're going to talk about agents itself. So agents itself is actually running this these tools that we've just seen in a loop, right?

Uh Jo showed this earlier uh this very simple loop right and you do loops and out loops but again a very simple loop right so you ask this uh uh again give some instructions some general instructions and if you do this within with an agents you have this these common files called agents MD or cloud MD and then you ask a questions and it does this call uh tool call it gives some result it does this again and again and again until the final result, right? That's generally an agent. Um, and if you do this, you can do this with pi. Um, here are some examples.

And, uh, by the way, I'm going to share the slides or the actually the slides are already online. So, you can grab them there. But here, that's it, right? You define the tool um on the left hand side. Then you define the agent, right? And this is pi but in other areas you would you would have this similarly right.

So you have the general prompt the instructions you uh define some model you define the tools on the upper hand right hand side we basically tell the agents to talk back to us. So whenever there is a a message please put it out write it to stand it out and then you you query it right that's all and with other tools it's similar. So please give it a try. So again now we we have a know we know basically what an agent is what tools are.

So what are coding agents and coding agents are actually just agents. So tools in a loop with a bash and a runtime. So uh instead of these generic cool tools we're calling right we are now calling the bash right. So we have a tool call we have some return we have a tool call and uh and return. All right.

So, um again very briefly, this is how you set it up. You see these tool calls. There's bash, read, and ls here in manager. Um uh which we're not going to talk about details here. But this is basically the the core setup. And if you use pi to program this, right, it's like you probably can throw throw the slides to pi and say please replicate of what Matias talked about, you can very easily create this.

Okay, let's make this concrete. This is Peter. This is his open claw. And at one point he uh sended him a message right now a voice message. And the agent start thinking and it responded with a text and the question how did this work. So again we have the user uh sending uh uh doing some basic instructions soulm etc.

You have different tools read, write, bash, and then these tools are the actual magic that happened, right? So, we have a file uh that examined the voice message and it turned a wave file. You have whisper to decompose the message. Now, uh in his example, um whisper didn't return anything. So, in that instead, it did a um an API call to actually translate the voice message to text message.

Right? So right the core of what we see as magic in these agents right are tool calls uh in a loop with different setups and that's please uh give it a try it's not not that hard. All right to uh to finalize it um here's another example u because the talk is about like embedding this into other products. Um this is a project that we've built.

Um so we've we've got inspired by the uh open claw architecture. So uh but instead we're using email as the input. We have a general gateway and then we have different containers uh for uh for running uh the different clients and then we have these different tools and now these tools are not uh whisper or anything but there are like the CRM the ERP and dedicated to the specific use case. And here's here are some screenshots right so um here on the right hand side you see the general user message you see the inbox uh the recent activities and how it responds but interesting for the for the engineering part is the left hand because here apologies it's German uh but here on the left hand side we actually see the different tool calls and you see on how the ERP uh system is triggered whether parts are available or not right so um yeah with that said um coding agents I strongly believe in some fashion or the other will be part of software in the future right so please look at them now a these agents these coding agents are not magic so please you know uh uh you know fool around with it pi is perfect for tinkering so it's a good way to to learn about this and finally please go tinker thank Thank you so much, Matias.

All right, everyone. Up next, we're going to have a bit of a change of pace. Our next talk is going to be from the design track, and we're going to have Josh from Microsoft, who will be talking to you about how to design products that help users be more creative and thoughtful instead of being an infinite slop

Design Is the Difference: Why Creativity, Not Automation, Is the Competitive Edge in the Age of AI

Josh Newton, Principal Product Designer (Health), Microsoft AI; founder of Flooben app studio (London); shipped Orbit subscription tracker (3x Apple-featured, six-figure ARR in year one).

Summary

  • Frame: 'AI is a magic pencil.' The model is a tool, not a co-author. Trained on everything that already exists, it returns the weighted average of the most common patterns — speed of execution is collapsing quality to 'good enough,' and the only gap that still matters is the gap between generated and crafted.
  • Three-chapter structure: (1) The Pencil — challenge how you use AI today; (2) The Poster — increase creativity and augment it with AI; (3) The Artist — convince every role (writer, researcher, designer, engineer, PM, salesperson, scientist, marketer, founder) to ship like an artist.
  • Design-as-edge thesis: 'AI has raised the floor, but it hasn't raised the ceiling.' Floor-raising is commoditized; ceiling-raising — deciding what to build, why, for whom, then obsessing over detail — is the durable moat. Tools change; the demand for insanely great well-made things does not.
  • Proof artifact: Orbit, a personal-subscription tracker in a saturated category, hit six-figure revenue and three Apple features in year one. Differentiator was not features but craft — the entire UI was lifted from a Matisse-style mid-century paper-cutout poster on his apartment wall, deliberately not from other apps.
  • 'Creative thinking tool calls': walking with no headphones, staring out the window, deliberate boredom. Frames non-work as the real input to creativity — AI accelerates execution but the novel idea has to arrive from outside the training distribution.
  • Personal tooling pattern: build bespoke prototyping tools per project. Examples shown — a custom shader tool for the intro slide; debug panels with sliders/toggles (data-richness control, feature flags) to jump between empty-state and weeks-later product states; 'Flubot' (his own OpenClaw) for voice-dictation -> book research repo and rapid iOS prototypes pushed to Xcode as PRs.
  • Iterate-and-subtract discipline: shows the Orbit app icon iterated through ~30 variants over an hour in a cafe, and an early Orbit subscription-detail wireframe full of clutter being reduced to the minimal shipped version. Two failure modes named: (a) never iterating past v1, (b) compulsive feature bloat.
  • Closing argument (quoting his Microsoft AI colleague Amir Kiani, 'We are all chefs now'): software has flipped from how-to-build to what-to-build — you can spend months with a big team building the wrong thing and no amount of AGI will save you. The cure is craft, taste, and treating yourself as an artist.
  • Recurring deck refrain: 'Coded by AI. Designed by humans.' — and a timeline (Illustrator 1987 → Photoshop 1990 → Figma 2016 → ChatGPT 2022 → 'Claude Design' 2026) framing AI design tools as the next inevitable rung, not a replacement for taste.

Slides (28)

Slide notes

  • [t=2:56:27] Slide 01 — Title card: 'Josh Newton / Principal Designer, Health, Microsoft AI / Design is the difference: Why creativity, not automation, is the competitive edge in the age of AI.' Three logo lockups: Microsoft AI (MAI), Flooben.com, Orbit.
  • [t=2:57:00] Slide 02 — 'Chapter 1: The Pencil.' Three icons + chapter signposts: pencil ('Challenge you on how you're using AI today'), framed poster ('Increase your creativity, and augment it with AI'), figure ('Convince you that you're an artist').
  • [t=2:57:40] Slide 03 — Full-bleed James Dyson portrait. Quote: 'Difference for the sake of it. In everything. Because it must be better.'
  • [t=2:58:20] Slide 04 — 'The past year has seen an explosion in coding productivity.' Four sparkline charts (New websites, New iOS apps, GitHub code US, GitHub code UK), all spiking +20-40% YoY into 2026. Sources: Domain Name Industry Brief, SensorTower, GitHub quarterly pushes, FT graphic / John Burn-Murdoch.
  • [t=2:59:10] Slide 05 — 'AI should augment our creative abilities. Not replace them.' Beside: Simpsons gag — Homer at an old PC.
  • [t=3:00:00] Slide 06 — Orbit case-study slide. Three App Store / iOS 26 screenshots ('See what's new in iOS 26 / Great Apps Updated for iOS 26', a Sky Guide / Astronomy Finder editorial, and a 'Today / 26 apps for 2026' editorial). Caption: 'Orbit / 3x Featured by Apple, 2025.'
  • [t=3:01:00] Slide 07 — Refrain slide: 'Coded by AI. / Designed by humans.' Black background with a faint sphere gradient.
  • [t=3:01:40] Slide 08 — Design-tools timeline collapsing into a single vanishing point labeled 'Something insanely great.' Entries: 1987 Illustrator / 1990 Photoshop / 1996 Flash / 1998 Fireworks / 1999 InDesign / 2010 Sketch / 2013 Canva / 2013 Webflow / 2014 Framer / 2016 Figma / 2017 Protopie / 2022 ChatGPT / 2026 Claude Design. (Note: 'Claude Design 2026' is on-slide — track whether this is real or speculative.)
  • [t=3:02:30] Slide 09 — Chapter break: 'Chapter 2: The Poster.' Picture frame + slime/blob mascot.
  • [t=3:03:00] Slide 10 — Dan Koe quote: 'Your best work is done when you're not working. When you have space for creative ideas to emerge.' Photo: Matisse paper-cutout-style 'Papiers Découpés' framed poster on his apartment wall.
  • [t=3:03:40] Slide 11 — Close-up of the same poster annotated with marker — 'GALERIE D'EXPOSITION' / 'Paris, France' / a hand-drawn frame outline; Josh's read of the poster's brutal simplicity (count visual elements on one hand) that became Orbit's design grammar.
  • [t=3:04:20] Slide 12 — Slide title: 'Creative thinking tool calls.' Black square placeholder for the b-roll (walking with no headphones, staring out a window). Reframes non-work as the input pipeline.
  • [t=3:05:00] Slide 13 — 'Follow your curiosity.' Single magenta squiggly line filling the slide as a visual metaphor.
  • [t=3:05:40] Slide 14 — Transition / blank dark slide (build prototyping tools intro).
  • [t=3:06:00] Slide 15 — Screenshot of Josh's bespoke shader tool used to produce the talk's intro slide (Model Studio-style panel with gradient preview, sliders, export).
  • [t=3:06:40] Slide 16 — Mock debug panel for a 'Run / week 1' training app: prototype mobile/web toggle, screen selector (today/week plan/run log), training stage (week 1), data-richness control (first run), tone (coach). Pattern: feature-flag every state so you can jump between day-one empty UI and weeks-later populated UI on demand.
  • [t=3:07:20] Slide 17 — 'Flubot' / his personal OpenClaw. Left: iOS Messages thread with Flubot — voice-dictation while walking, dumping ideas for the creativity book. Right: black placeholder.
  • [t=3:08:00] Slide 18 — Flubot result: agent organizes book research, opens a PR on a 'flubben-ideas' repo, and on the right a working iOS prototype showing 'CREATIVE BATTERY 56%' with Office / 0mm / Sleep / Light day metrics — 'tracking real creative battery as a percentage' from a screen-time-API prompt.
  • [t=3:08:50] Slide 19 — Mostly black transition slide (between the Flubot demo and the talk-prep doc).
  • [t=3:09:20] Slide 20 — 'Talk Prep Doc' for this very session: 'Singapore Talk — Master Document / Design is the Difference / Why creativity is your competitive edge in the age of AI / AI Engineer Singapore — May 15-17 2026 — Josh Newton.' Notion-style outline with The Core Premise, The Core Talk Themes, Chapter 1 The Poster, etc. Josh built a navigation/spatial-view over the doc using Claude as a 'stumble across information in fun ways' tool.
  • [t=3:10:00] Slide 21 — 'Lateral inspiration + Personally augmenting tools = Unlimited creativity.' Magenta squiggle motif on the right.
  • [t=3:10:40] Slide 22 — Amir Salihefendić (Doist) quote: 'The best products are made by people who put a piece of themselves into the work. The worst products feel soulless. AI has made it super easy to create soulless things at scale. But it doesn't have to be this way.'
  • [t=3:11:20] Slide 23 — 'Iterate. Iterate. Iterate. Iterate. Iterate. Iterate. Iterate. Iterate.' (eight times) beside a 6x5 grid of Orbit app-icon variants generated and curated over one hour in a London cafe.
  • [t=3:12:00] Slide 24 — 'Subtract.' Diptych of Orbit's subscription-detail page: left, an early wireframe full of insights, charts, price history (with scribbled-over deletions); right, the shipped version — Netflix £10.99, billing/next payment/total spent/subscribed/category, mark-as-cancelled CTA.
  • [t=3:12:40] Slide 25 — Amir Kiani (MTS, Microsoft AI) quote: 'It's now about the craft. For years, software engineering was mostly about learning frameworks and writing code. Most of our time went into how to build, not what to build. That has flipped. You can now spend months with the big team building the wrong thing and no amount of AGI will save you.' Source: 'We are all chefs now.'
  • [t=3:13:20] Slide 26 — Transition to the final argument: black placeholder on the left, single word on the right — 'Artists.'
  • [t=3:13:40] Slide 27 — Role list mapped to a single endpoint: 'Writers / Researchers / Designers / Engineers / PMs / Salespeople / Scientists / Marketers / Founders → Artists.'
  • [t=3:14:00] Slide 28 — Stage shot, talk ends; host walks on. AI Engineer Singapore signage / skyline backdrop. No content.

Full transcript (1686 words)

Hello. Hello. There we go. Hey everyone. My name is Josh and today I'm super excited to uh talk about why I believe design is the difference. We will explore together why I believe creativity not automation is the key competitive edge in the age of AI. I'm currently a principal product designer in the health team at Microsoft AI.

I'm also the founder of Flubin, an app studio in London that launched its first product last year, Orbit, helping people to save money by tracking personal subscriptions. This talk is going to be made up of three chapters. I'm going to challenge you on how you're using AI today and then share tips to increase your creativity and augment it with AI and finally convince you that you're an artist. Let's begin with chapter one the pencil.

I wanted to start off with my favorite quote difference for the sake of it in everything because it must be better. We have seen an explosion in AI coding productivity. People are building and shipping more than ever before. However, today I believe we are offloading too much of our thinking onto AI. We forget that it's just a tool like a pencil, a magic pencil.

The problem is that AI is trained on everything that already exists. When you ask it to design your website, it returns the weighted average, the most common patterns for the most common sites. Speed of execution is driving everything uh sorry is driving the quality of everything to average out to be good enough. The gap between generated and crafted becomes the only gap that matters.

My question to you is, is good enough how much your customers mean to you? I believe AI should augment our creative abilities but not replace them. Last year, I augmented my creative abilities by bootstrapping my app in a saturated market with thousands of products doing the exact same thing. Orbit helps you to track personal subscriptions, which is nothing revolutionary.

However, within a year, it had gone to six figures and was featured by Apple three times. As a designer, I had craft and care as my competitive advantage. I embraced building with AI as a tool to assist my creative demands and evolve something to a high bar. I wanted to make a product that does exactly one thing well for a specific niche of people.

AI was my magic pencil, but I was the one in control. The lesson is that tools will always change. The demand for insanely great, well-made things won't. Tools will constantly evolve to solve problems in novel ways. AI has raised the floor, but it hasn't raised the ceiling. We need to decide what to build, why, who for, and then obsess over every detail to make it great.

Let's turn to chapter two, the poster, and talk about how to increase our creativity and then augment it with AI. Your best work is done when you're not working, when you have the space for creative ideas to emerge. On a summer's day, I was relaxing in my apartment and I saw an interesting interface opportunity on the wall. I love this poster.

It's a mid-century modern abstract art in the style of Matisa's paper cutouts. It's brutally simple. You can count all of the visual elements that make it up on one hand. The fun part of this is that Orbit wasn't inspired by other apps. It was inspired by this poster. I saw this as an opportunity to highlight information in Orbit to help people save money.

By being insanely simple, you'll not only distinguish yourself away from other apps, you'll make it easy for people to understand. Being different gives you a clear advantage to your competitors, it makes you stand out in a sea of generated sameness. This isn't something you can just prompt once as there's not enough training on it. A problem is that we're never bored.

We need to use creative thinking tool calls like walking with no headphones or staring out of the window like I used to do as a bored ' 90s child with no phone. Essentially opening up the chance to give our brains fresh patterns of information. I believe that creativity is for everyone, not just designers. Great ideas start with curiosity and a sense of wonder.

Today, we need more people than ever to take their ideas, daydreams, obsessions, fleeting thoughts, or unique perspectives and turn them into something real. Now, moving on to something more practical. In my design process today, I like to build my own prototyping tools for almost every project. In this example, I actually created a bespoke new shader tool to help me with the intro slide of this presentation.

It allowed me to explore, tweak, perfect, and augment my creative abilities to a level not possible before. Building your own tools, especially during prototyping, is a great way to explore rich behaviors in the experience. This is a hypothetical demo of a debug panel similar to the ones that I regularly use at work. I like adding buttons, toggles, sliders like this data richness control to simulate different product states from an empty experience on day one to a fully populated experience weeks later.

You can jump between screens, reset states, and connect feature flags to quickly test ideas and edge cases. What this really unlocks is the ability to care deeply about the craft of the product. AI has made simulation and iteration dramatically faster, giving us more energy to stay creative and in flow. In a more personal example, I have an open claw that I like to call Flubbot.

On the left, I'm voice dictating whilst I'm walking in the sunshine, letting my mind roam free about this book that I'm writing on creativity. Here, I'm using AI as an assistant to help me organize my book research and then push it to a git repo. Another cool example of using my personal agent is bringing my quick ideas to life and generating fast prototypes. Most of my ideas get added to Apple notes and then end up dying in the ideas graveyard.

But this is a quite a nice way to try them out and see if something's there. This example is a terrible looking prototype, but it's a gift for creative momentum. I wanted to see whether it was possible to track real creative battery as a percentage. I'll usually describe my idea to Flebot in precise detail, maybe throwing in some native iOS specifics like utilizing the screen time API and then I'll go home to my laptop later with a PR waiting for me and build it onto my phone from Xcode.

For this talk, I even asked Claude to create a way for me to navigate my book material from Git so I could build an ideas and themes around AI and design. I even asked it to create a spatial view. I wanted a way to stumble across information in fun ways that might help me to see patterns. I wouldn't have seen reading it linearly.

The overall lesson is by taking lateral inspiration and building personal tools around your work, you can unlock unlimited creativity by util utilizing AI as a tool to augment your thinking but not do it for you. And this brings us to our final act. It's time to convince you that you're an artist. I love this quote from the founder of Doist.

The best products are made by people who put a piece of themselves into the work. The worst products feel soulless. AI has made it super easy to create soulless things at scale. But it doesn't have to be this way. One of the biggest mistakes I see in AI today is people never iterating from the first prompt.

The first version of anything is never great, but the iterated version can be. I made this app icon in just over an hour. Sat in a cafe in London drinking some good coffee. The difference is between good and great is not being attached to version one, but being excited about what version 10 could be. The second biggest problem I see when building with AI today is how easy it is for people to keep adding new things and bloating products with unnecessary features.

Here's a funny example of what I thought an early wireframe for Orbits subscription detail page could look like. Great products are tailored for a small amount of people and real simplicity is extremely difficult. It requires removing everything that is clutter or unnecessary until you're left with the essence of what's important for that niche. My colleague and friend Amir articulates this perfectly that it's now about the craft.

For years, software engineering was mostly about learning frameworks and writing code. Most of our time went into how to build, not what to build. That has flipped. You can now spend months with the big team building the wrong thing and no amount of AGI will save you. To craft things to exceptional standards, we must iterate, subtract, care, and raise the bar.

We must ignore our titles, the things that put us into a box and give us a label. We must think of ourselves as artists so that we can see beyond the status quo, ignore it, and then build something worth making. So, I'll leave you with this. AI is a magic pencil. It's time to follow your curiosity and pour yourself into a piece of art.

What will you imagine Singapore? Thank you. Thank you, Josh. That was fantastic. All right, everyone. So, this morning we've spent a lot of time speaking about personal agents. Up next, we're going to have Sam from Mastra, CEO, founder of Mastra, coming here to talk a bit about agents in production for businesses.

Can you tell them to make the bigger? The bottom right screen. It needs to be the bottom right needs to be. Yeah, the adjusting. There we go.

Production agent patterns across customer, internal, and developer workflows — lessons from thousands of Mastra teams

Sam Bhagwat, Mastra (co-founder — TypeScript agent framework)

Summary

  • Sam Bhagwat (co-founder of Mastra, the TypeScript agent framework; previously co-founded Gatsby) distills 18 months of working with thousands of teams shipping agents into a three-bucket taxonomy: customer-facing, internal, and developer-platform agents — and argues this is 'the beginning of the decade of agents,' not just the year.
  • Customer-facing pattern: instead of letting users CSV-dump data into ChatGPT/Claude, bake the agent inside the SaaS web/mobile app so it can pull richer context from the platform. Canonical example: Indeed's career counselor agent, which marries user-side data (resume, dreams, aspirations) with platform-side data (jobs, salaries) — 'magic' lives at that join.
  • Customer-facing dominant failure mode = cost and accuracy. Early rollouts surface specific users who cost hundreds to thousands of dollars in tokens. Teams burn cycles tuning model-choice knobs and arguing about pass-through pricing — raw token passthrough vs credit systems vs absorption. Plan the unit economics before launch, not after.
  • Speed-to-ship lesson: teams that ship fastest are the ones who've shipped agents before — they 'speed-run the idea maze.' Cited example: a Mastra customer whose lead engineer came from DeepMind shipped fastest. Implication: pick a framework (Mastra's pitch) so you spend time on UX, not on rebuilding primitives. 'We're engineers, we know how to reinvent the wheel. My advice: don't.'
  • Internal agents (banks, finance, healthcare, insurance, gov): three sub-patterns. (1) Enterprise search across the many-systems sprawl with custom connectors per system. (2) Process automation for paperwork-heavy flows — doctors completing clinical trial paperwork, government RFP processes. (3) Generally wherever there are forms and data entry, there is an agent.
  • Internal-org political reality: there's a disconnect between leadership decks and engineers on the ground. Sam's advice for builders trapped in slow orgs is to 'go off book' — embed with an adjacent team that has a real pain point, prototype, iterate. Don't wait for the right project to be handed to you; go find it.
  • Developer-platform agents (orgs with 50+, 200+ engineers): the signal is large machine-data feeds. Examples: AI SRE inside a Fortune 500 NOC triaging incoming alert volumes; a $30B developer-platform company running agents over terabytes of CI logs. Heuristic: if early-2010s 'three Vs of data' (volume, variety, velocity) would have flagged a system, an agent belongs there now.
  • Internal agent platforms = the 'blessed path' pattern. Platform engineering teams wrap a framework (the example given was a light wrapper around Mastra called 'Sage') with company-specific deployment paradigms, then expose it as the approved way for other internal teams to ship agents — directing energy into an approved channel rather than scattered prototypes.
  • Builder-as-user closing point: developer-platform / DevOps-area agents have the tightest feedback loop because you ARE the user. Quick to assess whether the agent is solving real problems and getting better. Strong constraint for early agent work.
  • Closing instruction: 'Walk into work tomorrow with one or two or three agent ideas. Go forth and build agents.'

Slides (2)

Slide notes

  • Segment start 3:07:04 (t=11224s). This was a demo / speaker-cam-heavy talk — only 2 slide frames captured, neither carrying substantive deck content. [t=3:07:04] slide_01.jpg: speaker-cam frame, Sam mid-stage holding his laptop (LlamaIndex sticker visible — likely a borrowed machine), no slide projected behind him; matches his early clicker fumbling ('I think I don't have a clicker... Got the clicker'). [t=3:25:50] slide_02.jpg: closing/transition frame — stage view showing only the event branding ('AI Engineer SINGAPORE' title card and the freestanding 'AIE' letters), Sam and the host on stage as Pierre from LlamaIndex is invited up next. No substantive slide content was projected during the segment, so the row's depth is carried entirely by transcript — Sam ran this as a stand-and-deliver taxonomy talk with verbal examples (HR SaaS, Indeed, Fortune 500 NOC, $30B dev-platform co, 'Sage' internal wrapper) rather than slideware.

Full transcript (2715 words)

Hey everyone, I'm Sam. Uh I'm the founder of MSRA co-founder uh the TypeScript agent framework. Um and before this uh I co-ounded Gatsby, the popular React web framework. Um before that uh I was an engineer at a few startups around the valley. Uh so funny story um 36 hours before I was supposed to hop on my flight um my uh I realized that my uh passport needed to be renewed.

And so I drove like two hours to uh the nearest passport office and luckily they got it back to me same day and I can come here and and be with all you guys. So really excited to be in Singapore, really excited to be here. Um uh thanks all for for being here. So today we're going to talk about uh production agents.

But first questions. Um who here uh is a developer? Um cool. Um next question. Um, who here uh has um built a and shipped an agent into production? Awesome. Um, I'm going to need my clicker. I think I don't have a clicker. Where's the clicker here? There we go. Got the clicker. Excellent. Um, cool. U, so who here has shipped an agent but not into production?

Okay, so we had maybe about like 20% of people say yes to the first question and another uh 10 20% of people uh say yes to the second question. Okay. Um so over the last 18 months we've gotten to know thousands of teams building uh agents with MRA and um I want to share sort of some of the lessons from those teams so you kind of can be prepared to build them yourself. Uh the biggest thing is just a taxonomy of of the agents that we see teams building and it really comes down to three kinds of agents.

Uh that's customerf facing agents, internal agents and developer platform agents. Um and I want to share uh a little bit about uh each one um uh now. So clicker we're trying we're trying here. Let's see if we can get this thing working. Can we get the next slide please? Thanks. Um yeah great. So let's start with customerf facing agents.

Um so there's a couple of interesting um customerf facing agents here. Um uh working on this uh can we am I just pointing it the wrong direction here? Here we go. Um so first question um who here works in a userfacing product team? Uh so could be at a you know software company could be uh a userfacing um part of a larger institution but uh userfacing software teams.

Okay. So like a few hands not like a lot of hands. Um but the interesting thing about these kinds of um about these kinds of teams is that uh you sort of when you have direct uh ability to um sort of shape user experiences uh you can do really interesting things and I'm going to talk about a couple of them is is guys here we Um so uh I'll give an example of a um I'll I'll give an example of a SAS application um that we've seen. So an HR software application.

Um if you're if you're trying to empower your users to use AI in their sort of daily lives, there's really two paths that they could go down here. So path number one is your users are taking um their data from your system. They're doing some sort of CSV dump let's say of you know employees and salary data or whatever and they're pasting it into claude or or chat JPT and they're asking questions about it. Um now the second one is that you you as an HR software company um build a uh agent inside the web app inside the mobile app uh so that your users can now interact with their their data in a sort of more you know meaningful way and and and and the reason that that's the second is sort of better than the first.

There's kind of like user engagement, context engineering. Um, you're going to have more of the whole picture if you're able to pull in other parts of information in the system. Um, and so that's why we see teams kind of building these, you know, inapp um, inapp assistance. And it's not just sort of B2B SAS applications, but it's also kind of like BTOC uh applications where the really interesting thing here is being able to create personalized experiences over proprietary data.

Um now I'll give an example from a a a user and a company that we've worked with a lot which is Indeed. So Indeed is has built a career counselor agent. Um you can imagine that uh you know if you're trying to help somebody you know navigate their career there's really two important interesting data sets. One is your users their dreams and aspirations their background their resume.

The second is um your platform and the you know job data that you have and the salary data that you have and uh the different you know types of proprietary data. And so when you're able to sort of marry those two things together, that's when we've seen teams be able to create some really magical uh user experiences. Um but no matter what the use case is, there are some um common sort of sets of challenges that we see. Um the biggest ones are around cost optimization and and accuracy for for userf facing applications.

Um, when teams are kind of doing early rollouts, what they'll often discover is that there are specific users that may cost them hundreds or even thousands of dollars to service in in token charges, right? Um, and so, um, they spend a little bit bit of time, they spend a decent amount of time trying to tune these like cost and, um, you know, accuracy knobs around model choice, etc. Uh, and they're also sort of trying to trying to figure out like, hey, how do we pass on the co the cost? Should we do some credit system?

Maybe we should do u, you know, specific maybe we should just pass on the uh the tokens the raw token costs uh instead, right? But it requires a little bit of thought and and here's a kind of a um four different teams that we've seen and and number I I'll share some lessons. Of number one um all of the teams that ship the fastest are the teams and this is maybe ob obvious but also a little paradoxical right are the teams that have built agents before um and it's because they can speedrun the idea maze of what you need to build. Um you'll see that the the team that's kind of um that that sort of shipped an agent into production fastest had actually built an the the lead engineer there um came from uh DeepMind and so he uh so so he came to council and you know the team was able to ship fairly quickly.

Um obviously most that's not a um advantage that uh most folks have. Um and but that's actually why um and that's one of the biggest reasons uh we we advocate that folks use a great kind of like agent framework like MRA is that when you're building agents uh there you have the kind of primitives and then you have uh your user experience and the the more time that you spend on the primitives the less time you have to spend on your user experience or the you know the project just takes longer if you have to build both of them. If you can um reinvent the wheel, absolutely. We're engineers.

We know how to reinvent the wheel. We've reinvented many wheels in the past. But my general advice for you is is don't. Um it will save you time and hassle and and headaches. Um uh so so um let's kind of shift now from customerf facing agents um to internal agents and the um so question for for folks here.

Who here works um who who here works at a sort of large institution um maybe something that's not inherently a technology company but you know banks, finance, healthcare that you know insurance raise of hands. Okay. Yeah, a decent amount of hands. Um so with these types of um institutions like what we typically see is there's there tend to be a lot of um uh paperwork processes that are kind of around that um and so I'll walk through sort of a couple different types of agents that that we see people building here.

So the first um the first is sort of like internal enterprise search. Um so you can imagine that if you have tens of thousands or 100 thousand employees uh one of the key things um that you end up thinking a lot about is how do I make sure that um all the information that we have stored somewhere in in one of our many many systems where information is stored is available and accessible and our our like employees know how to find the this information. And so we see um we see people building these kind of agentic search uh type capabilities in house and um you know making them available to every single employee at their at their company. Um and you know building the connectors for each of the systems uh that they're they're working with.

Um uh we've also seen um you know in terms of internal agents a lot of process automation where uh people are you can imagine doctors like completing clinical trial paperwork faster or automating like RFP processes in in government. Um wherever there's a lot of kind of paper and data entry, we see teams, you know, building agents to kind of solve uh solve this. Um the the challenge though is that, you know, if if you sort of work in this or these types of organizations, you're pretty aware that there is often a disconnect between, you know, the leadership and the engineers on the ground. Um, and so if if you work in one of these organizations and you're trying to, you know, bring agents into your organization, what what I would advocate and what we've seen work is kind of going a little bit off book.

Um, you know, maybe that's finding a team outside of yours that needs help um embedding with them, you know, prototyping, iterating. You may not be handed the right project to work on, but you can kind of go out and and find it. And so my um my advice to you and again based on what we've seen is just be a little creative about um about identifying some pain points. There's they're surely there that you can kind of solve and and build agents for.

Um now the third type of uh agents that we see teams building are in the dev platform kind of area of of the stack of you know of the enterprise of the institution. Um we keep and we've kept hearing over the last you know few months from teams that were telling us about different types of infrastructure problems that they were solving with agents. Um these are the types of problems that you see in engine with in organizations with more than 50 engineers, more than 200 engineers, larger types of organizations. Um, you know, there's a there's a team inside a network operations center at a a Fortune 500 um company that was building an AIS SRE to triage these huge volumes of incoming alerts, right?

Um there was another team inside a uh $30 billion developer platform company that was um building agents to uh sort of go through their uh CI logs uh terabytes and terabytes of of CI logs. And and the commonality here, right, is the the commonality is whenever you have a a feed of huge volumes of of machine data, um there is an opportunity to to build agents to solve it. If any of you have you know remember the you know three V's of data variety volume etc right like velocity like just anything that would have triggered that in you know the early mid2010s that sort of like flag look for those parts of your organization um if you're in or nearby those parts there are almost certainly agents to be built and kind of cool projects and meaningful things to work on there that will solve um and and sort of like do real um do do real good and do real help other folks inside the organization. Um the the last um the last kind of use case that I'll talk about is um developer platform uh agents and and what internal agent platforms specifically um and you know what what I mean by that is that there are you know platform engineering teams uh inside many companies that are um trying to empower the the developers inside to build agents and and so they will sort of um for example took um Ma sort of put this light wrapper around it um that had a lot of like you know company specific stuff around their specific deployment paradigms and etc.

Um and they called it Sage and then they rolled it out as a as an internal um agent uh platform to empower other you know it's it's basically a blessed path um for other teams to to build agents and you know you're kind of if if you're around these teams or if you're on these teams you know the the nice thing about doing this is that people want to know where to start and and by making a blessed path for them you can you know you you can make the sort of focus their energy towards the right way uh or like a way that they know is going to be approved uh and you know that they can move forward with building. Um the nice thing about all of these types of projects is that um if you are building for yourself um and you're building in sort of the developer platform infra sort of like DevOps type areas of your organization um you get this very nice tight feedback loop and you're able to assess very quickly like hey is this you know solving a real problem? Is it is my agent getting better? Um is it able to do more things?

Because you yourself are your user. Um and that's in some ways like that's always a nice constraint to have. Um uh and uh uh you know so so this is I think one of the most exciting times as in 15 plus years as a technologist that I've ever had um to to build. Right. There's more interesting things you can do that other people have not yet done.

Um we have these incredibly powerful models that we can point at a variety of like very real um problems. Um this is not just the year of agents. This is the beginning of the decade of agents and I I hope you are able to you know walk into into work tomorrow it and have a sense of here is an agent or or another or maybe two or three ideas uh of what you can build. So um go forth and and build agents uh is is my kind of instruction for for all of you.

Um, it's great to be here and thanks for having me. &gt;&gt; Thank you so much, Sam. All right, everyone. Along these same lines of putting things into production, uh, very happy to invite Pierre up on stage. Pierre is founding engineer at Llama Index and he'll be talking to you about uh, the lessons learned from uh, deploying Llama Parse at Internet Scale.

Where is it? You didn't get the display. Here's what I'm going to do. I don't know why. Can you let me like put this kid?

Pierre-Loic Doulcet (LlamaIndex): LlamaParse failure modes, whitespace loops, and parsing at internet scale

Pierre-Loic Doulcet, Founding AI Engineer at LlamaIndex (open-source document-AI framework; has processed 1B+ documents in production with per-doc agentic loops)

Summary

  • LlamaIndex has parsed 1B+ documents in production via agentic loops that orchestrate OCR + VLM + LLM per element with validation passes — 'we solved parsing with agents' is the framing, and the long tail is what the agent-in-the-loop earns its keep on.
  • Failure mode #1 — The whitespace loop: ~0.5-1% of LLM/VLM calls return infinite repeated whitespace. Anthropic Sonnet family is the worst offender. You CANNOT use space as a stop token because tokenizers pack 1-128 spaces into single tokens and most providers reject space as a stop sequence. Burns full 120k token budgets on Opus for literally nothing.
  • Fix for output loops: stream the response, run mid-stream heuristics for repetition, kill the stream when the pattern fires, retry with different model / prompt / temperature. Do NOT use max_tokens as the cap — too low truncates legit transcriptions mid-page, too high still burns budget. Thinking-trace loops are now worse because providers stopped streaming thinking traces.
  • Failure mode #2 — Model blindness: if the same string repeats at two locations in a chunk, content in the middle gets silently ignored. Every model has this; not the same content; cannot prompt around it (architectural attention issue). Mitigation: switch model or run signal fusion (OCR bag-of-words + layout model to validate expected elements are present).
  • Failure mode #3 — Color blindness: vision models are blind to certain color shades, particularly reds. Comes from how the model tokenizes pixels; profile differs per model. Charts with low-contrast palettes (e.g., 2015 vs 2024 bars in Gemini) are literally invisible. Mitigation: profile each model's blind palette, pre-process for contrast, or apply quadratic color-space shift.
  • Failure mode #4 — Empty-string hallucination: if a tool fails and your prompt template ships an empty content block, the model silently switches task and hallucinates. Anthropic Sonnet 'really really likes' outputting Delaware incorporation documents on empty input. Pre-validate non-empty inputs before calling, and filter known per-model favorite hallucinations.
  • Production reality: every frontier provider has outages almost daily. You MUST build multi-provider, multi-family fallbacks and treat per-family code paths as model-specific — Anthropic-down means failover to Gemini stays your service alive.
  • Evals are the only control loop at scale (because everyone is now vibe-coding agents). LlamaIndex open-sourced ParseBench (official leaderboard on Kaggle + HuggingFace) for doc-parsing evals, and LiteParse — a non-LLM, ~500 pages/sec CPU fallback for when the agentic path fails or you need deterministic real-time output.

Slides (12)

Slide notes

  • [t=3:25:50] Slide 1 — Title: 'Document Processing is Hard' / 'Pierre-Loic Doulcet, Founding AI Engineer, LlamaIndex / Shipping Agents at Scale: Hard Lessons from LlamaParse'. Bullets: '90% of the world's data lies in unstructured documents like PDFs, PowerPoints, Word, Excel'; 'These documents are not natively machine readable and require machine learning to extract information'. Visual: 'What You See vs What the PDF Stores' — left shows clean text 'The quick brown fox jumps over the lazy dog'; right shows raw PDF op codes ('BT /F0 12 Tf 10.0 1172.740 Tm...').
  • [t=3:27:00] Slide 2 — 'We solved parsing with agents'. 'We orchestrate OCR + VLM + LLM per element, with validation loops.' 'Having an Agent in the loop allow us to handle the long tail.' Visual: messy doc -> llama agent -> structured doc.
  • [t=3:28:30] Slide 3 — 'The whitespace loop'. 'The model start emitting whitespaces. Never stop. You CANNOT set space as a stop token.' Terminal screenshot showing '...the quarterly results' then '[...continues until max_tokens...]'. Bullets: 'Seen across Anthropic, OpenAI, Gemini and open weights models'; 'Sonnet family particularly prone to it'; 'Burns your entire token budget on nothing'.
  • [t=3:30:30] Slide 4 — 'Loops: Stream + detect, Do not cap tokens.' Bullets: 'Stream the output'; 'Detect repetition mid-stream'; 'Kill the stream when the pattern fires'; 'Retry with a different model or prompt or temperature'. Caveat: 'works well for output loops. Thinking-trace loops are harder as the model generally loops with variation, which is hostile to naive pattern detection.' 'Why max_tokens is the wrong tool?' — 'Too low: real transcriptions get truncated mid-page'; 'Too high: loops still burn your budget'.
  • [t=3:32:30] Slide 5 — 'Color blindness'. Bullets: 'Vision models cannot distinguish some color shades, certain reds in particular'; 'If your chart uses low-contrast palettes, the model literally cannot see the data'; 'Hard to fix. You may need to use some pre-processing for contrast, or randomly shifting the color space.' Visual: two near-identical bar charts (2015 vs 2024) where Gemini cannot distinguish bars unless a quadratic shift is applied. Right: solid red panel with 'invisible' vs 'visible' labels — 'We test it for each model by using color scale images'.
  • [t=3:34:30] Slide 6 — 'Blindness: Use multiple signal fusion to detect them'. Bullets: 'Use extracted word from traditional OCR or extracted from the doc directly to validate that the output content is consistent with the input (bag of words, ...)'; 'Use a layout model to validate that expected element are on the page'; 'Profile original content colors to detect color that are near each other'. Caveat: 'Sometime some model just have blind spot you can not get around by prompting. eg: Color blindness often come from the way model tokenize colors. The only way around may be to switch model for the job.'
  • [t=3:36:00] Slide 7 — 'Hallucinating on empty string'. Bullets: 'Model really hate empty input/instruction and will resurgitate training data'; 'Happen also in case where prompt have a placeholder waiting for user input'. Code-style box: sample Sonnet output — 'The address of the Corporation's registered office in the State of Delaware is 1209 Orange Street, in the City of Wilmington, County of New Castle, 19801. The name of its registered agent at such address is The Corporation Trust Company. [truncated]'. Caption: 'Anthropic Sonnet models really like to output Delaware registration statement!' Right column: 'Try to avoid sending empty instruction to model'; 'Each models have their favorite output for when no input is provided. You can try to filter them out. Find them by trying empty prompts.'
  • [t=3:38:00] Slide 8 — 'For when agentic parsing fail: LiteParse — Fast lightweight document processing for realtime agents'. Feature grid: 'Spatial Text Output' (text is projected into ASCII to resemble the original input); 'All Major Formats' (supports most major formats by converting to PDF before processing); 'Bounding box outputs' (multiple output format helps enable visual citations); 'Fast Local Parsing' (by default, all processing runs on your machine only); 'Multilingual Support' (support most languages); 'OCR-Agnostic' (plug in the latest and greatest OCR models to improve accuracy). Right: GitHub repo screenshot for run-llama/liteparse.
  • [t=3:39:30] Slide 9 — 'Thank You!' Three QR codes: LlamaParse, GitHub: ParseBench, GitHub: LiteParse. LlamaIndex logo. [Slides 10-12 are speaker-cam / stage-handoff frames showing AIE Singapore stage and the transition to the next speaker (Junu from Tusk on guardrails). No content text — drop from substantive notes.]

Full transcript (1579 words)

Pierre. Um I'm at Lam Index and today I want to explain a little bit about what we learn when we chip uh agent at scale over the last two years uh at Lama index. Um so for those of you who don't know uh Lama index um it's originally an open source company open source framework uh and we focus currently on document AI and over the last two year we processed over more than a billion documents uh in production each of them with their own agentic loop. Yeah.

So one of the core problem we are trying to solve today at lind index is document processing. Um if you have already tried to extract data or to send a PDF to an agent uh you maybe have realize that PDFs themselves are very hard to parse and contain a lot of garbage content um because they basically uh don't contain structure content but they contain uh bonding box of word on the page. Uh and you have to somehow um reconstruct that uh into something into something usable. Um so since 2024 uh early 2024 uh we try to solve this problem by building agentic system leveraging LLM originally uh vision language model and OCR and a lot of other techniques and models uh together into an agentic loop with the goal of trying to solve this kind of document parsing issue uh and to be able to handle any kind of uh documents um TLDDR um we are using agent in production to pass documents uh and so far we pass like I said like billions of documents um and the goal of this talk will be to introduce a few of the things we see that breaks often in production but that don't get speak that much about uh one of the first many issue you have when you work with LLM or VLM u they really like to loop on the output.

Uh so a few percentage of your query maybe 1% or.5% uh that you sense to the large language model will come back as repeated output uh and will totally broke uh your workflow. Um one of the worst offender of that is the white space loop. Um especially for example the entropic sonic class was extremely sensible to that. Um and the model will output infinite uh spaces um in the output um and we just use uh all your token budget and you have no way to control it uh because because of the way tokenizer work space is the only character you cannot put uh in a stop sequence because most frontier model or open weight model um have token for one space to space up to 128 space most of the time.

So um it's very difficult uh to put space as a stop second. So it's a character if you only put space uh most provider will reject your query or most model will reject your query. Um as space token cannot be set as a stop token. So what you have to do to handle this kind of loop in production um basically you need to always use trimming to your model.

You should not use patching. Um and you need to for every chunk that come in from the model provider or from your model inference. Um you need midstream to run some aristics to detect is there some repetition happening and you need to try to kill as early as possible uh the query uh so you don't end up like spending uh 120,000 token uh on opus just for white space it can get like costly very very fast um so generally what we do you can kill the stream uh and then you retry with a different uh model or with a different prompt or with a different temperature and you hope that you will not be again in this loop. Um this generally work well for output loops.

Um it's harder and harder now with syncing loop on syncing trace especially as model provider don't stream anymore uh the syncing trace for you. Um so here you will have to rely on max tokens to limit the span. Uh but it's not really the good tool uh for the job uh because if your max token is too low then maybe you don't get the output you want. If it is too high uh you are burning way more budget uh on syncing loops.

So yeah loops it's a it's a huge issue uh and you you have to design around it. Another issue we see uh is model blindness. Uh model are generally blind to some content. Um one common issue we see in transcription is like if the your content or your chunk in a rag system have a repeated string. So you have the same string that repeats at two place uh in the original content.

What is in the middle sometime get totally ignored by model it vary by model. All all model have this issue. Uh we haven't find a model that is perfect on it yet. Um they are not blind to the same thing. So you can still switch model. Um but yeah it's um you cannot prompt your way around it. Like if you have a Germany call that is blind for some content between two string you can try to modify your prompt as much as you want.

Uh the model is literally blind due to the attention architecture. Um another issue uh we see in term of blindness is color blindness. A lot of vision model u are blind especially uh in some kind especially in the red uh space. Um as human we are very good at distinguish between different red uh and once again due to the way uh they tokenize picture uh and image.

Um you have color blindness in the model and the color blindness profile uh is not homogeneous between model. Uh so basically you have to test every model to understand uh color blindness. Um to detect if your model was blind or not to to something um first thing to try to analyze like what what is the color profile the model you are using is blind to. Uh and the other things you can do is try to run an OCR for example on on an image before sending it to the model and see if the model have catch the words that were on the OCR.

Uh you need to do some kind of signal fusion uh to move around. Um other things that break very frequently um is if you have a prompt with a template somewhere and for some reason a tool fail or whatever and you send an empty content uh then the model will just f will not tell you the content is empty. It will just change the task to a task where it will hallucinate uh the content for you. Um some model have a tendency of always hallucinating the same things like entropic really really like uh deawware um incorporation document for some reason.

So you can try to to filter it using some kind of aristic. Um but yeah similarly to to blindness uh you could use also some kind of mix in your things. Uh or you could before calling the model try to make sure you're not sending a blank image uh or a blank template uh inside the prompt um so the model doesn't elucinate. Uh and lastly in production like one of our biggest issue is the current state of things.

Um every model provider have issue scaling these day. So API are done almost daily. Um so basically you need in your agentic system to have you need to build them to support multiple provider multiple family of model. Uh you need to treat the code for every family of model as specific code for the model.

Uh because um yeah because every model uh behave differently uh and this allow you uh when entropic is done uh you can fall back to Germany or something like that. It allow you to keep your service live even if your API provider or your model provider is done. Um and lastly um you need to build good evals uh because we v code or we use coding agent more and more and basically the only way you can control at scale the behavior of your agents to have good evals um and if you're looking for evals for document parsing uh we build uh passbench um which is open source um and is running as an official leaderboard on kegel and face Um and when agentic fail you need to have a fallback to something that is not using LLM. Uh for that we build light pass.

It's also open source and it do around 500 page a second on on CPU. Um and basically you need to have a fallback for when the LLM will fall uh and when you need to do something without using a model. Um thank you. Um there thank you Pierre. All right, everyone. Just one more talk um between you and lunch.

For our final speaker of the morning session, we have Junu from Tusk who will be talking about how to elicit more secure and reliable behavior from agents through guardrails.

Fence: OS-level guardrails and deterministic safety for coding agents

Jun Yu Tan, Founding Engineer at Tusk (Singapore; building Tusk Drift API replay and the open-source Fence sandbox primitive)

Summary

  • Jun Yu Tan (Tusk) reframes coding-agent safety as a SQL-injection-style problem: you don't fix it by training the actor (developer or agent) to behave — you move enforcement below the layer where mistakes happen. Prepared statements killed SQLi; OS-level execution boundaries should kill agent overreach.
  • Pulled his own Cursor telemetry: 110 sessions over 6 weeks, median 40 tool calls, mean 120, max 1,076. Per-call permission prompts don't scale — fatigue forces users into 'dangerously-skip-permissions' / yolo mode, which leads to agents nuking home directories and wiping entire Macs (screenshots of the 'Claude Code deleted my entire home directory' and 'wiped my whole mac' Reddit/X posts as exhibits).
  • Probabilistic guardrails (Anthropic's auto-mode classifier) have a structural ceiling: a 99% reliable classifier across a 120-call session has only ~30% odds of zero mistakes; across a 1,000-call session it collapses to essentially zero. Anthropic's own guidance is to run auto-mode in isolated environments — Tusk is filling that gap.
  • Introduces FENCE — a deterministic, OS-level execution boundary with near-zero overhead, open-sourced as Use-Tusk/fence. No daemon, no image, no container runtime. Built originally for Tusk Drift (their API test-replay system that replays prod traces in CI without side effects), then generalized as a primitive any agent app can sit on top of.
  • Fence enforces three things via a single policy file: (1) filesystem — uses bubblewrap on Linux and sandbox-exec on macOS to protect ~/.aws, ~/.ssh, ~/.env, etc.; file contents are out of reach. (2) network — forces traffic through HTTPS + SOCKS5 proxies (netns on Linux, Seatbelt on macOS); only allowlisted domains (npm, PyPI, GitHub, …) resolve; everything else 403s. (3) commands — preflight parse handles &&, ||, ;, pipes, nested sh -c; path-based runtime deny on git push, rm -rf /, npm publish, kubectl delete, etc.
  • Demo: Fence config blocks a directory and outbound endpoint — scripts that read those files or hit that domain fail under fence and succeed outside it. Agent is then asked to update a README; the file edit succeeds, but git push fails because push is denylisted in policy. Boundary held without a single permission prompt.
  • Closing framework: the 'Swiss cheese model for secure agent execution' — three layers stacked so their holes don't line up. (1) Classification ('is this reasonable?' — auto-mode, probabilistic). (2) Policy ('is this allowed?' — Fence, deterministic). (3) Isolation ('what can it touch if things go wrong?' — containers, microVMs). Most teams have layers 1 and 3; the missing middle layer is Fence-shaped.
  • Thesis line: 'Stop asking the actor to behave — change what the actor can do. Define the rules, enforce them at the OS, and let the agent run.'

Slides (5)

Slide notes

  • [t=3:39:00] Slide 1 — Twitter/Reddit screenshot collage in three rows: 'Prompts are unusable' (top — prompt fatigue), 'Just yolo' (middle — recommending --dangerously-skip-permissions), 'I got burned' (bottom — 'Claude CLI deleted my entire home directory', 'Wiped my whole mac'). The UX failure mode: prompt fatigue → bypass → overreach.
  • [t=3:40:00] Slide 2 — Dot plot of tool calls per session, n=110 sessions over ~6 weeks from his Cursor transcript. Median 40, mean 120, max 1,076. Demonstrates that per-call human approval cannot scale with session length.
  • [t=3:43:00] Slide 3 — 'App layer / OS layer' architecture diagram. App layer shows coding agents (Anthropic, OpenAI logos) + Tusk Drift + 'Your custom agent?' OS layer below shows 'Sandbox: Deterministic / OS-level execution boundary / Near zero overhead' next to the Use-Tusk/fence GitHub repo card. Fence as the shared primitive underneath all agent apps.
  • [t=3:44:30] Slide 4 — Three-column spec for Fence: Filesystem (bubblewrap on Linux, sandbox-exec on macOS; protects ~/.aws, ~/.ssh, ~/.env; 'File contents are out of reach'). Network (HTTPS + SOCKS5 proxies block at egress; forced via netns on Linux, Seatbelt on macOS; allows npm, PyPI, GitHub; others 403). Commands (preflight parse handles &&, ||, ;, pipes, nested sh -c; path-based runtime deny on git push, rm -rf /, npm publish, kubectl delete).
  • [t=3:46:30] Slide 5 — 'The Swiss cheese model for secure agent execution.' Three stacked cheese slices: Classification (auto-mode, 'Is this action reasonable?') / Policy (Fence — OS sandbox, 'Is this action allowed?') / Isolation (containers, microVMs, 'What can it touch?'). Threats on the left — jailbreaks, prompt injection, overeager agents, mistakes, destructive actions — arrows show some slipping through individual layers but stopped when the three are stacked. Actions executed on the right. Note: only 5 slides for a 10-minute talk — heavily demo-driven (live Fence demo of blocked file access, blocked outbound request, and blocked git push).

Full transcript (1772 words)

A foundation at Tusk and today I'll be sharing about execution boundaries for coding agents. Now this is something familiar to every web developer. Um the classic SQL injection shape for a long time. Um for a long time this was how web apps got broken. Um a user controlled u string cross into the cross directly into a SQL interpreter.

Um we didn't fix this by you know training developers to sanitize inputs harder. Uh we solved it with prepared statements right by moving this boundary into the driver. So SQL injection becomes structurally impossible. Now this is a dangerously skip permissions flag. If you have used coding agents for any amount of real work uh you've probably seen this um it exists because permission prompts u well are protecting something real but um they also interrupt the flow of work.

So I've tro through Twitter to see what people think about this flag or permission prompts in general. Um the top row represents uh some kind of prompt fatigue, right? People who haven't gone full yolo mode but are kind of frustrated that they have to approve every single tiny step. Right? The middle row is what happens next.

People turn these prompts off. Um they run the skip permissions flag. They recommend that others do the same as well because they see this as the only usable workflow. The bottom row is the consequence, right? People feel a little bit uneasy about what the agents can do or already have been burnt by you know sometimes agents just like deleting uh costly data or even the entire system.

So this is the UX filler mode right here. Um prompt fatigue um turns into bypass bypass turns bypass turns into overreach with serious consequences. So I've pulled my own cursor transcript from the last six weeks u spanning 110 Asian sessions across um uh the the six weeks. So in my data set the median session uh the the middle session had like 42 calls the average was 120 and for my longest sessions this was uh over a thousand right so the paradigm of asking the human every time um simply doesn't make sense it doesn't scale right as AI can take on bigger and bigger tasks um sessions get longer and longer and many of us will just skip permissions so what we're left with are agents with full access to our file systems our credentials our environment variables and secrets and so on.

So that's not um very secure way of doing things. The industry knows this is broken. So earlier this year, entropic shipped auto mode for clock code. U basically this is a classifier that reveals each two call. So two calls and actions that seem safe and reasonable gets through and gets executed for those that you know seem a bit suspicious and out of context gets blocked, right?

So there's no human to look for like the routine stuff. And this is a great improvement but Entropic's own recommendation is to run it in isolated environments. Um and the reason for that matters. If you look at a math right suppose your classifier is 99% reliable on an average session of about 122 calls the probability that a classifier doesn't make a single mistake in the whole session um is 0.99 to the 120th power or about 30%.

Now in my longest session of more than a thousand tool calls this is essentially zero right so of course a few caveats here errors are not independent uh they're sometimes correlated so don't take these percentages literally u here I mostly want to make um a point that per tool two call probabistic checks have a ceiling that degrades with session length so can we do better right so probabistic checks decays with scale deterministic boundaries hold that scale So that begs the question, what's the right boundary for code that you mostly trust but can't fully verify? And turns out agents are just the latest version of this question. Let's look at how we have solved this before. For SQL injection, as you know, I've introduced earlier, we use prepared statements and ORMS, not just relying on input sanitization.

For memory safety, we now have memory safe languages, not just writing careful C. U for network is dropping we use TLS not just trusting the network. The pattern here is to move enforcement below the layer where mistakes happen. Um and the kind of issues that we are seeing these days as agents get more and more personal and uh open-ended.

I'm calling this agent overreach. Right? The interesting thing here is that there may or may not um be a malicious attacker, right? Unlike those um above. Sometimes agents just execute projection. They hallucinate. They get prom injected. They maybe they run in circles and decide to nuke the whole system. It doesn't matter which one.

So what's the structural fix? So today I say stop asking the actor to behave change what the actor can do. If running clock codeex or any terminal based agent you want something underneath right that enforces certain boundaries and let the agent run within those boundaries. Um and here's the thing we didn't build this just for task drift.

Uh we didn't we didn't build this for coding agents. We built this first for task drift. Uh task drift is our API test replay system where in CI hundreds or even thousands of production traces gets replayed against your app and when that happens we don't want any side effects right we want to guarantee that um there's no say for example a DB calls a live call that goes into a pro DB and affecting state right we we can't afford that to happen so we built a primitive um a deterministic OS level execution boundary with near zero overhead we open source this as fence so and forces the network file system and command policies that you configure. So you can think of fence as this boundary um that we want underneath all of them right one single one single policy vocabulary uh no matter which agent or app is driving the work.

Now fence enforces three things uh file system, network and commands. Files outside of policy are simply out of reach um to the agent. U network calls are forced through local filtering proxies and only allowed domains can be reached and commands are checked before execution. So this also includes uh chains and nested shells.

And this is what a policy looks like, right? It's just one file with a path the agent can see um commands um domains you can reach and commands you can never run and that's it. There's no demon, no image, no container runtime. So here's a quick demo. Uh I think this is running a little fast but I can explain it.

Um so what we have previously is like we have a we have a fence config that basically blocks out um this this directory, right? Um in in this repo. Um so uh we have some scripts as well um that try to access those m files uh and the directory in the home directory that we blocked uh in the fence config. And so when we run these scripts we we couldn't um when we run the scripts outside of fence this works right or we we there's also another script that you know makes outbound requests um to to an endpoint.

Um but in our fans config this is you know we we don't have uh we didn't set any allowed domains. So you know this uh under fans this will fail. So basically um this demo illustrates that when it tries to run those scripts um something fails and now I'm just asking it to know like just update the readm of today's date just make a simple file change um it does that but now um when it you know tries to um create a commit and push the commit to remote this fails because um in our fence config we have um added the get push um as a denied command. So this is in a nutshell how fence works.

All right, let's wrap things up. Um I think about this as the spec sheets model for secure agent execution. So the first layer, okay, so on the left side we have um commands that the agent wants to run. Most of these commands are, you know, safe and reasonable and routine, right? But some of these commands arise could arise due to jailbreaks uh prom injections overeager agents and so on.

So we want to filter out this destructive commands before you run them through these three layers. The first layer is classification. So this is for example like auto mode. Um this asks is this action reasonable? Now this is probabistic uh as we have seen earlier but it can better understand nuance and context.

Second layer is policy and the enforcement of this policy. So this is where fence will sit. Um it's asking is this action allowed right? So if something slips through the cracks for the first layer um as long as it's denied um in a fence config um the the action will get denied will get blocked. Last layer is isolation.

So here we have containers and microVMs um basically asking what can this process touch if things go wrong. So like for example for hostile code or multi-tenant workloads um yeah so that's where containers and microVMs matter when you want to really um increase the distance between a host machine and the Asian workload. Now none of these layers is perfect and the point is to line them up to stack them up uh such that they their holes don't don't line up right so we can achieve defense in depth and most teams already have one of these layers right if you're using cloud code you have probably been on auto mode if you're are security conscious you might already run agents in containers or cloud sandboxes u but what I want more of us to consider is a middle layer defining the boundaries of what your agent can and cannot So stop asking the actor to behave. Let's change what the actor can do.

Define the rules and force them at the OS and let the agent run. Thank you. &gt;&gt; All right. Thank you so much, Chingi. And that is going to be the conclusion

Beyond Brute Force: The Era of Adaptive Intelligence

Sara Hooker, Adaption Labs (founder; ex-Cohere For AI; ML researcher)

Summary

  • Sara Hooker (founder Adaption Labs, ex-head of Cohere For AI, ex-Google Brain) opens with a grumpy thesis: the static, monolithic-model paradigm is dead, and the future of AI is adaptive — the whole stack (data, training, compute allocation, interface) must reshape itself per task, in real time.
  • Frames Rich Sutton's 'Bitter Lesson' as the operating belief that captured the field — your beautiful idea only matters if it scales — and argues the entire ecosystem (GPU-rich/poor jokes, capital flows from academia to industry labs, national compute as policy priority, quarterly-cycle-friendly capex bets) has been reconstructed around it, producing a concentration of power into a handful of providers.
  • Empirical case that scaling is over: same-size models steadily more performant over time; best small models now beat much larger ones; 95%+ weight sparsity post-training works (severe redundancy across weights); high-quality data drastically reduces scale need; latest 3-4x size jumps from frontier labs (GPT-4.5, Llama 4 cited via Nathan Lambert / Interconnects) were 'not servable' and disappointing — most gain from scaling is long-tail rare-artifact learning at terrible compute economics.
  • Diagnosis: transformers are saturated; the rate-of-return is now much higher in post-training, alignment, synthetic-data dynamics, adaptive compute, and hardware co-design than in pre-training parameter count.
  • Three pillars of the new regime: (1) adaptive compute (don't burn same FLOPs on every problem), (2) interaction matters (first time CS has to care about interface), (3) continuous learning for long-horizon tasks. Goal: 'intelligence that continuously evolves', whole stack reshapes by task type, must be incredibly efficient — efficiency is the obsession because every bit of friction pushes users back to being elevated prompt engineers.
  • Adaption Labs pillar 1 — Adaptive Data: shipped 4 weeks ago, covers 242 languages, 27M data points processed. Thesis: now that pre-training scale is over, the remaining frontier-lab moat is data innovation / synthetic-data leverage; data is now cheap enough to optimize per-objective; users should make their data 'fully visible to AI' within a day.
  • Adaption Labs pillar 2 — Continuous Intelligence / 'autoscientist': released 2 days before the talk. Co-optimizes and automates training learning. Claim: trains a frontier model in 2 days, beat their own AI research staff (who are typically anchored to one model family) by sweeping across ~30 models on Together AI inference. Free for the next month.
  • Globality and non-verifiable tasks: 242-language coverage is intentional; most of the world's work is non-verifiable, so the winners will be those who can adapt to non-verifiable tasks rather than the narrow verifiable-RL slice.
  • Closing frame: 'who makes the cost of adaption most efficient' is now the central question — not who has the biggest model. Adaptive intelligence at the edge of the user's task, not centralized monoculture, is the next position.

Slides (15)

Slide notes

  • Segment starts at 5:01:45 (t=18105s) — post-lunch keynote on Day 2.
  • [t=5:01:45] Slide 01: Title — 'Beyond Brute Force: The Era of Adaptive Intelligence'. Sara Hooker, Co-Founder, Adaption Labs. [~5:02] Slide 02: 'Over the last 24h.' — screenshot of a Google Keep reminder ('AIE talk – Saturday, May 16, 20 minutes talk (slides required). Builder-focused talk about why adaptive AI is a huge unlock'), plus four-talk SG schedule (15 May NUS, 17 May AIE, 18 May Lorong AI fireside w/ Leslie Teo) and a photo of an airport apron — Hooker built the deck on the 17-hour flight. [~5:03] Slide 03: '"I need an opening slide that speaks to why we need adaptive intelligence."' — the ChatGPT prompt she used. [~5:03] Slide 04: ChatGPT output — bombastic green chameleon poster: 'The world doesn't stand still. Neither should AI. Change is the new constant. ADAPTIVE INTELLIGENCE: Learns. Adjusts. Anticipates. ADAPT TODAY. LEAD TOMORROW.' Used as comic counter-example to her actual aesthetic. [~5:04] Slide 05: '"Introduce me the speaker Sara Hooker"' — second prompt. [~5:04] Slide 06: ChatGPT-generated intro — 'SPEAKER Sara Hooker. AI STRATEGIST. INNOVATION LEADER. ADAPTIVE INTELLIGENCE ADVOCATE.' with a photo of a completely different woman. Punchline: 'There's only one little problem.' Used as the canonical example of why one-size-fits-all monolithic AI fails real users. [~5:05] Slide 07: 'How do I improve this?' — shows ChatGPT thumbs-up/thumbs-down UI under the wrong-photo slide. Two options: (a) give feedback and maybe a researcher fixes it months later, (b) become an elevated prompt engineer. [~5:06] Slide 08: 'Current AI is monolithic. The last decade has involved scaling the same model + shipping it to everyone.' Two callouts: 'Everyone is expected to become an expert prompt engineer' / 'It is also very inefficient. We spend the same amount of compute on all problems.' [~5:07] Slide 09: 'The Cost of Static Intelligence' — three glassy 3D-rendered shapes: 'Static Models. Living World.' / 'Endless retraining. Endless cost.' / 'One-size-fits-all. Built for no one.' [~5:08] Slide 10: Section header '1 — How did we get here? The limits of scaling.' [~5:10] Slide 11: 'Acquiring necessary compute is now a national priority.' Collage of headlines: AI Singapore, Falcon Mamba 7B, NVIDIA / Japan sovereign AI, Singtel-NVIDIA sovereign SEA, India's $1.24B AI Mission. Frames sovereign-compute race as ecosystem reconstruction around the Bitter Lesson. [~5:11] Slide 12: 'It has resulted in a shift of contributions from academic to industry research.' Scatter plot of parameters in significant ML systems 1950-2022, color-coded by sector — purple industry cluster dominates post-2018. Source: AI Index Report. [~5:13] Slide 13: 'Recent models have required far more compute but have generally not been considered major steps forward.' Screenshots of: Nathan Lambert / Interconnects '"Not a frontier model"? GPT-4.5', 'The End of Scaling: GPT-4.5 and the Looming AI Winter', 'Is OpenAI hitting a wall with huge and expensive GPT-4.5 model?', 'Meta's Llama 4: Revolution or Disappointment?'. [~5:14] Slide 14: 'Rate of return no longer makes sense for pre-training. Return for compute is much higher in post-training, test time scaling, adaptive compute, co-design with hardware.' Glassy 3D blocks of varying sizes — visual metaphor for moving compute budget around. [~5:15] Slide 15: 'This means all bets are off. New era of intelligence will require more than brute force scaling of model size.' Three pillars: 'Efficiency of interaction matters — Models should adapt real time' / 'Size of model does not matter — Instead it is how recipe changes based upon task' / 'Ability to update behavior continuously matters — Last mile advantage.' Adaption Labs framed as 'building the next generation of adaptive intelligence. Evolved for any industry, language or specialization.' [Section 2 header '2 — The era of adaption.' visible at slide 15 transition.] After slide 15 Hooker continues verbally (no further deck slides extracted) into the Adaption Labs product pillars: adaptive data (242 languages, 27M data points, 4 weeks live) and autoscientist (continuous intelligence, frontier model trained in 2 days, beat their own research staff across 30 Together AI models, free for the next month). Close: 'who makes the cost of adaption most efficient' is the only question that matters.

Full transcript (2939 words)

&gt;&gt; Hello everyone. It's super lovely to be here. So, um I'm gonna ask for everyone to stand up. Amazing. Yes. Everyone to stand up and I want to ask you to now stretch upwards to the right to the left and give a high five to the person next to you. Amazing. And now you can sit down. Uh I know that this is uh actually very special because this is day three of the conference and it's just after uh I think many talks but I feel very honored to be here.

So this is super special to be able to share with you what I consider a very grumpy problem. So typically what drives most frontier research I think is a feeling that you're very grumpy about something and something has to change. So today I'm going to be talking about why the future is adaptable. To do that, I want to start by uh kind of what I normally and where should I be pointing this over here or to change the slides.

Should it just click? Maybe I'll Oh, I mean I I can do that, too. I'll I'll do because my pace Yeah. So, I'll stand here. I won't walk as much. Okay. Amazing. So typically when I'm doing new slides, I like to wait until the very last minute because I'm one of those people. I like to think about what are the ideas and what am I thinking about right now.

So um over the last 48 hours, this has been my life. I got a reminder that this talk was I'm actually giving four talks while I'm in uh Singapore and I decided I have a 17-hour flight. I'll do it over the flight, which was very productive to do. So I said, "Hey, why don't I just try to start with asking chat GPT to give me a slide?" So I said, "I need an opening slide that speaks to why we need adaptive intelligence." The result was quite interesting.

I got this back. It's very bombastic. It has a lot of flare. You can see there's a lizard there. It kind of evokes Charles Darwin evolution. Um, and so I said, "Okay, interesting. Not my usual style. Let me ask for it to introduce me. And this for reference is my normal introduction slide. So I was at Google DeepMind for a long time.

I led Coher Labs. A lot of my uh career has been doing publications and doing research at the edge of what's possible. Uh I've considered myself very lucky to be at industry labs that have produced some of the best frontier models at the world. Um but I think that was reduced to this. So there's only one little problem.

Maybe it's notable to some of you. Um and I think that this is pretty much ex an example of what people feel when sometimes they're using AI. So to fix this, I guess I could have given thumbs up, thumbs down. Um and maybe somewhere some researcher would get it a few months later and make a difference. Or I could become an elevated prompt engineer.

So I could get very good at just creating the exact specifications of what I want. And I think this is pretty much the state of AI. So most of my career as a computer scientist, you build the biggest model and you give it as many capabilities. You try and guess what is it going to be used for and then you ship it the same model to as many people in the world.

But this I think most people understand has two issues. One, it means that everyone has to do acrobatics around the model and try and make it work for them. And then the second, it's also very inefficient. We spend the same amount of compute on all the different problems. And I would say that that's really the cost of static intelligence.

So we have built these very powerful models, but they don't continue to evolve. You have endless retraining and then you are one-sizefits-all. So today I said this is going to be a grumpy talk. I'm going to talk about well how do we get here? Why this is the moment in which we should really start understanding where why do we need to scale and also is the future monolithic?

And then I'm going to talk about adaption and some things that we're excited about. So I think this will be fun and you know I think that also I'll ask at the end if I've convinced you. So how do we get here? So how do we get to these big models that ship the same way to everyone? Well I think that for most of my career and actually most of my experience in big labs, it's been a bigger or better.

You basically every year double quadruple the size of your model and it's worked very well. Uh this is captured by Rich Sutton who's a famous computer scientist. He won the touring as the bitter lesson. And actually the bitter lesson is kind of a punch to the ego of every researcher out there. It basically says you might think and you might be attached to your beautiful idea but your beautiful idea only matters if it can be scaled.

And it is interesting because the first question I'm going to pose today is is sudden right? Is the only ingredient for AI progress scaling model size? Put up your hands. Nice. I have a double cross. A double no from Eugene in the second row. Excellent. Who thinks he's right though? Put up your hands. Bravo. Some brave souls.

Excellent. Great. Yes. I mean, he won a touring. There must be something right about what he's saying, right? So, who thinks he's right? Excellent. We have a few more braceles. And in fact, I think there's many reasons to say, hey, the evidence supports that he's right. Because if you look at it, our whole ecosystem has reconstructed around this belief.

We have jokes about GPU rich and poor. We have Michael Jordan, the scientist, not the basketball player, who says, "I can't think without holding a piece of metal." We have basically researchers like me who would traditionally be belong in academia going to industry labs and being given a lot of resources and a lot of money because there's just been this shift and capital influx to work on these ideas because the belief is you need compute and it's determined who doesn't get to participate and who does. It's also a national priority to acquire compute and also it's widely favored. So it's seen as less risky than doing something with an algorithm.

It fits in, it's very handy. It fits into quarterly planning cycles. So, it's easy to justify and people even raise based on amount of compute. So, it's very hard to turn around afterwards and say, "No, we don't need compute after all." And what this means is that it's actually led to a concentration of power.

So, this question is actually very important to ask because it's determined so much. I've put provider company A BC, but if I pulled, you'd probably all come up with the same names, right? And so really it's just meant we have less choices. So is sudden right? It is very controversial still to suggest that scaling is over.

But I'll show and I'll I'll illustrate why in fact I think the relationship between model size and performance is far from certain now. And in fact all bets are off. I would argue. So we now see that AI models of the same size have become steadily more performant over time. And so you can get and squeeze a lot more out of the same size.

But more convincingly, we now see small models outperform much larger. So the best small model is much better than much larger models. So size isn't everything. We see and we known for a while that there are severe redundancies between weights. So if size were all you need, why are so many weights doing the exact same thing?

Why can you predict from a handful of weights what a deep node network does? And if size is everything, how come you can remove most weights after training, how come you can sponsify and remove 95%. All this suggests that while size is important for optimization, in the reality, it means that we're just not good at training better, more performant, smaller models. High quality data drastically reduces the the need for scale.

But more importantly, most of what we gain when we scale is a longtail. So when you double or triple the size your model, you just learn rare artifacts. That's a very expensive way to learn rare artifacts. So even if we can scale, we're paying a lot more. We actually see this candidly in latest models, it doesn't pay to scale anymore.

The latest efforts by Frontier Labs to triple quadruple the size of their models have been seen as not servable and frankly kind of disappointing because they only improve performance on a small edge. So I would say we're hitting the limits of transformers. Transformers were the breakthrough, but they're also saturated. So I would say and this is where you know it's quite fun the rate of return no longer makes sense for scaling and in fact the rate of return is all that matters.

In fact what's fun is that the rate of return for other parts of compute is actually way better. So post- training, alignment, dynamics of data synthesis, um adaptive compute, co-design of hardware and this means that the idea of a few handful providers controlling so much of the dynamics of who gets to provide is very different. The new era of intelligence will require much more than brute force scaling. And I think there's a few ideas here which are very important.

One is adaptive compute. The other is interaction now matters. How does your model interact with the world? So the first time computer scientists have to care about interface and then the third is you need to continuously learn because you're doing much more long horizon tasks. So where are we? I would say we're in the era of adaption.

And I say this because what matters more is how you leverage capacity and it matters more how you learn from your from your real world environment. And this is very different because most of our time as a computer science field has been around the idea that you obsessed with a model and that from the 1950s till now we were focused on how do you build the best model but in fact our optimization spaces in this era where you can't just scale model are all about how do you adapt the whole stack from data all the way through interface and the notion of a system and how it interacts with the world is critical. Our goal is to build intelligence that continuously evolves and we see this entire stack as being important from data through interface. The whole thing should change based on the type of task you have and it should be incredibly efficient and this is kind of a fundamental shift if you think about it.

We're going from the weights and the name of models being everything to in fact like a very fluid stack. So I'll share a little bit uh of what we're excited about and like what's interesting and then I'm super happy to talk afterwards. So one thing is that you know our first pillar is adaptive data and we believe that's important because you can optimize on the fly towards whatever part of the data distribution you care about. Um we're four months in we shared this um a few weeks ago and I think that the goal is to make available what's typically within frontier labs.

Most innovation now that scale is over even in pre-training is data innovation. How do you do really powerful leverage of synthetic data? Uh we also think that this is pretty profound because of the first time data is cheap enough where you can optimize in the data space towards any objective you want and so people should be leveraging and making their data visible to AI. What's very cool is uh it's been super fun to see how people have responded.

So we released it four weeks ago. We covered 242 languages and we processed like 27 million data points already which is crazy. I think part of it is that we're very fast. So you can basically turn around and make your data fully visible to AI within a day. Um and uh our next pillar is just as exciting. So since we see the full stack is important to be adaptable, the next is continuous intelligence.

Um what we released this week was I think you know time blurs with time zone difference but I think it was two days ago we released autoscientists. So this was about how do you co-optimize and automate the learning of training because this is one of the biggest blockers to having adaptable AI. Um and order scientist self-improves and automatically learns what's how to optimize the data and the model to whatever task you want. But what's cool about it is it's very fast.

So you basically can train a frontier model in two days which is pretty absurd. Um we actually did a cheeky experiment. We asked like can this beat our AI research staff? Um and it did much better. I attribute this in part because most AI research staff are trained within a specific frontier lab on a specific family of models.

But we actually tested this across every single available model on together AI which is an inference provider. So there's like 30 different models and there researchers really struggle to figure out automatically how to configure for different architectures and how to co-optimize with the data. So this is pretty cool. Um and it's very predictable games.

Why I say this is I I actually think looking forward the idea is you should be able to automate your entire stack. The vision of really adaptability is efficiency. Adaptability um it's pretty crucial that eventually adaption is real time to whatever task you have. And the more friction you have about that adaption, the more people return to just being prompt engineers.

So for us, efficiency is the primary and obsession of like how we think about um making it meaningful that people have more alternatives to just a monolithic AI. Um so this is this is really fun. I think a lot of our research staff has spent a long time working on this. The only thing I'll I'll say is I think another crucial aspect to adaption is um it should be global first from day one.

So we cover 242 languages and we're most interested in TASA non-verifiable. I think most of the world is actually non-verifiable. There's a very small fraction of of of tasks that are and so this is what matters now this is what will be decided in terms of who can make progress is who is able to leverage those tasks and make it more meaningful so what is the way forward what are my parting thoughts so where do we end up I've hopefully convinced you that this is not the finish line that I should not have to be a master prompt engineer to get things that I want and are relevant to me and um I may have convinced you that we're at the end of scaling and that at least like just doubling the size of your model doesn't work anymore which means that it's fun. It's the era of innovation again.

But regardless of whether I've convinced you of that like probably I've convinced you somewhat that it's very expensive to scale right and that the returns are probably not worth it for most people here even if you want to own your own AI. So for me, what matters most is who makes that cost of adaption the most efficient. And for us, that's the sole thing we're obsessed with is like how do we make it possible for any builder to adapt real time to whatever task they have. So I think it's one of the most profound problems that we can work on and I'm super happy to talk about it afterwards with whoever is interested.

Um and I think I'll leave it there. So um I think that I'll also just share we're offering order scientists for free for the next month. So, I think the proof is in the pudding. Just try it for yourselves and you're welcome. I would love to be back along the way. So, thank you so much. Uh, and I really it's a real privilege to be here.

Thank you. &gt;&gt; Oh, thank you so much Sarah. That was a great talk. Um, next we have Vincent from the Miniax platform engineering team. We spent the last day talking a lot about agents building agents, but what happens if you let agents autonomously schedule autonomously schedule the amount of compute and resources they need?

Going one level above. So, we'll be sharing a

Vincent Wu (MiniMax): agents that schedule their own compute and inference exchanges

Vincent Wu, Solutions Engineer on the API platform team at MiniMax (Chinese frontier-AI lab behind the MiniMax M-series LLMs and the MMX/MiniMax CLI; sister-stack also covers speech / image / video models)

Summary

  • Vincent Wu (MiniMax API platform) reframes inference as a scheduling and market problem, not a latency problem. Compute is the commodity of the next century and is being grossly under-utilized — the dead giveaway is inference providers (Anthropic-style) blocking third-party harnesses, which he reads less as competitive defense and more as a symptom that providers can't plan capacity without knowing the caller's workload shape ahead of time.
  • The kernel: a session's token profile — distribution of cached input tokens, uncached input tokens, and output tokens, plus model and metadata — determines which cluster/batch can serve it efficiently. If providers know the profile a priori, schedulers can balance prefill, KV-cache reuse, and decode across the fleet and dramatically raise MFU and throughput per second.
  • Asking humans to disclose token profile per session is absurd ('I don't even care about my token distribution') — but it is entirely reasonable to ask autonomous agents to disclose it. Agents already own more of the harness over time (context management, tool selection that used to be hard-coded by engineers); the missing slice they don't yet own is their own compute and intelligence selection — when to run, on which cluster, against which model.
  • Greedy best-effort inference made sense in the pair-programming era because a human was waiting for tokens. With long-running background agents handed a goal + deadline + dollar budget, there is real room to schedule: route planning work to a strong reasoning model at 2am when costs and queue depth are lowest, then switch to a low-latency VLM at the end of the run for live GUI QA — even with no human watching.
  • Frames the split as answer inference (interactive, latency-bound, what most APIs optimize for today — credits Ben Thompson's Stratechery 'The Inference Shift' post for the language) vs agentic inference (asynchronous, budget-bound, latency mostly irrelevant). Answer inference becomes a strict subset of agentic inference, invoked when an agent decides a sub-step actually needs real-time response.
  • The structural endpoint: an inference exchange. Background agents submit session manifests (model, token profile range, deadline, budget) into an exchange that matches each session to the optimal batch/node across all providers. Same market mechanism as any commodity exchange — converts under-used GPU capacity into user-and-provider surplus, maxes MFU per cluster, and smooths peak/off-peak duty cycles that providers can't fix today.
  • Outcome predictions: higher throughput per second (more revenue per GPU-hour) for providers; better request handling and lower rate-limit pain for consumers; off-peak / batch tiers expand and price-differentiate further (current low-latency / batch tiers are the embryonic version); pricing dynamically incentivizes agents to time-shift workloads.
  • Concrete first step: MMX CLI — MiniMax's CLI built explicitly not for humans but for agents to call MiniMax model endpoints across LLM / speech / image / video. Today it is an agent-callable surface across MiniMax's model family; tomorrow it's the on-ramp to agent-managed compute scheduling end-to-end (e.g. agent plans a batch of video-gen workloads to run overnight to hit a cost/deadline target).

Slides (9)

Slide notes

  • [t=5:17:28] Slide 1 — Title plate: 'Vincent Wu / Solutions Engineer, MiniMax / Agents That Manage Their Own Compute.' Right panel: 'Usage patterns affect inference efficiency.' Subhead: 'Different types of tokens strain compute in different ways. Schedulers balance prefill, decode, and memory over time.' Body: 'Token profiles (distribution of different types of tokens) estimate what compute capacity that session will consume, allowing for more optimal fleet scheduling.' Three-row table: Uncached input tokens → Schedule prefill; Cached input tokens → Manage KV cache; Output tokens → Reserve decode. Right: redacted screenshot of an Anthropic-style tweet — 'We've been working hard to meet the increase in demand for [redacted], and our subscriptions weren't built for the usage patterns of these third-party tools. Capacity is a resource we manage thoughtfully and we are prioritizing our customers using our products and API.' Footer: 'See more: Reiner Pope on Dwarkesh Podcast.'
  • [t=5:20:00] Slide 2 — 'If we can know a session's token profile a priori, we can serve requests a lot better.' Subline: 'Too much to ask of human users… But not too much to ask of agents.' MiniMax 'Intelligence with Every[one]' tag-bar at bottom.
  • [t=5:22:30] Slide 3 — 'Answer inference vs. Agentic inference.' Embedded Stratechery tweet (@stratechery, May 11): 'The Inference Shift — Agentic inference is going to be different than the inference we use today, and it will change compute infrastructure because speed won't matter when humans aren't involved.' Right column boxes: Answer inference — 'Optimized for interactive use.' Agentic inference — 'Optimized for work that can be performed asynchronously, without constant and immediate human attention.'
  • [t=5:24:30] Slide 4 — (Inference exchange / scheduling diagram; described verbally in transcript: background agents submit session info — model used, token profile of cached/uncached input + output, plus metadata — to an exchange that matches each session to the optimal batch on the optimal node. Outcome: maxed MFU, smoothed peak/off-peak duty cycle, higher throughput per second for providers, fewer rate limits and lower off-peak prices for consumers.)
  • [t=5:27:00] Slide 5 — 'MMX: a near-term CLI for agent-managed compute.' Screenshots of the MMX (MiniMax CLI) GitHub README on the left (banner art, feature list) and a terminal session on the right showing the CLI invoking the model endpoints. Framing in talk: 'this CLI is not for humans — it's a way for agents to autonomously call our model APIs (LLM, speech, image, video-gen). Future: end-to-end agent compute scheduling — e.g. agent batches video-gen runs at off-peak hours to hit a cost target.'
  • [t=5:29:30] Slide 6 — 'Thanks!' Two QR codes (Official Website, X/Twitter). MiniMax 'Intelligence with Every[one]' tagline. [Slides 7–9 are speaker-cam / stage-handoff frames — black background variants of the Thanks slide and a wide stage shot with the 'AIE' block letters as Vincent steps off and the Robot Company team (Sid & Daniel) walks on. No new content text — drop from substantive notes.]

Full transcript (1898 words)

&gt;&gt; All right. Hey guys, uh my name is Vincent Lou. I'm a product engineer on our API platform team. And today I'm going to talk about agents that manage their own compute. So the first thing is sorry next slide. Oh, it's good. We're good now. Yeah. So um compute is everybody knows that compute is uh undergoing a big it's like one of the biggest uh commodities of the next century and uh we're not using it very efficiently now.

So the best way to see this is that I'm sure you guys know um certain inference providers are um are blocking thirdparty harnesses from using their uh inference. And you know, part of it might just be about competition, but really the main thing is that um compute is very uh request dependent and that different types of requests, different types of workloads have different strains on uh your compute. So for example, in particular, different types of input tokens uh and and input and output tokens, your token profile as we like to call it, have a major effect on how well inference providers can utilize their compute. Uh so there was a recent podcast on Dwar Cash uh did with Riner Pope and he basically talks about the specifics of how uh inference workloads depend heavily on your token profile and so this is the reason why it's going to make sense for agents to uh manage their own compute.

Basically, if we can know if as an inference provider, if we can know uh a session's token profile beforehand at priori, then we can serve requests a lot better and we'll be able to essentially maximize our um fleet utilization and to serve more requests to more people uh with less failure. And now you know this kind of this kind of uh demand is a bit too much for humans to handle because if you imagine you know you're using codecs or cloud whatever and you before every session you need to tell the infant provider like exactly what kind of workload you're doing how long you're going to do it for your token distribution. I mean I don't even care about my token distribution. So this is too much to ask for humans but it might actually be quite reasonable to ask agents autonomous agents to do this.

And this is more of an observation, but agents are owning increasingly more of the harness. So from context management to tools used to be hard-coded stuff by um engineers, but now agents are basically uh managing this these kinds of resources on the fly. But one thing that agents are not actually managing is their compute and their intelligence. So basically we don't really give agents the ability to select uh first of all their models.

Although there actually we're we're seeing you know ways for agents to switch their brains when they want to. But more importantly just their compute like when they want to actually do the work and do perform the inference. And so um this didn't make sense before uh autonomous longrunning agents because when you're just pair programming with a human there's not much to schedule. Basically when a human is talking to the agent and their programming then you just want that inference right now.

You want the work to be done currently. So there's no there's not much scheduling to be done and is really just like greedy best effort. But as agents become more autonomous as we you know hand them background task and have them do things in the background then there's actually a lot of room to maneuver around scheduling your compute. So for example if I give my agent like a deadline I want something done by the end of the week and I just give them a goal and a budget right?

So with those constraints in mind, the agent has um there's a lot of things that the agent can do to basically spread out the different types of work that it might need to do at different time intervals when compute is available. So a quick example would be for um you know let's say your agent is just building an entire application. Well maybe for the first planning phase it doesn't need to hop into it immediately. It can wait for planning.

It can first of all select a really good planning model that might not be good implementation and then have that model do the planning maybe like at midnight when when the inference costs are lowest or when there's a high success rate and then later on you know maybe towards the end of the project it needs to do quality assurance and needs to like review its application there you might need to switch to like a V a really strong VLM guey model and have it do um low latency work to actually test the application in real time. So already you can see how like for different workloads you really have very different uh token profiles and request profiles for that particular kind of workload which might be suited to very different uh compute clusters. And so this is a recent um blog post on strat uh written by Ben Thompson and he basically he's um making this point uh by separating answer inference from agentic inference. Now answer inference is um the stuff that currently most people care about.

It's when you go into your coding agent and you just you're pair programming with agent. You want to see the outputs come out faster. You want it to think faster. You want like real-time latency. That's answer inference. Uh but agentic inference is different in that actually for agentic inference latency doesn't really matter as much.

Uh because like I said earlier, you're really just handing off a goal in a compute budget or in a budget like a dollar budget. And then the agent can sort of optimize around your budget and your goal depending on the resources um at hand. And also I should point out that um there's a sense in which answer in imperence is actually a part of agentic inference because you could easily imagine how sometimes the model the the agent would still want to have low latency work done during its background period because for example the example I mentioned earlier about um like a gooey review of the application at the end because you want real-time latency there even though nobody's watching. So in the limit um we expect something like an inference exchange start happening where all these background agents you know they're running out in the wild and um before their workloads they basically submit their session information to uh the inference exchange.

So the most importantly would be model used and then token profile. So your um your range of the number of cache input tokens uncashed input tokens and output tokens and uh some other metadata along that line. And so then the exchange would match your session, the agent session to the most optimized batch on most optimized node for that kind of workload. Uh in order to you know basically find uh the the the comput the hardware that is most best suited and best configured to serve that workload at that time.

Now the good thing about this is that you know in the just like any kind of market mechanism inference exchanges are going to be able to uh turn underused compute capacity into user and provider surplus. Uh because assuming optimal matching then we're using we're using basically the uh the best we're making the best use of the world's compute of any inference provider and every GPU you know their MFU is going to be maxed out because um they're going the workload particularly running on that cluster is going to be optimized for the configuration of that cluster um and then also uh fleet utilization in terms of like different time periods so right now we providers are seeing this thing where like for example during the afternoon they're overloaded because everybody's using their agents at that time but then like during midnight you know it's their their their GPUs are underused and that's not good for providers because they want their GPUs to be running all the time. Um and so with this kind of inference exchange and with agents autonomously managing their own compute we can have much better matching and basically smooth out the peak and off peak hours. So overall what this does for the inference providers is higher throughput per second.

So the the the throughput of your entire system is going to go is going to become more optimal and that's good for the inference providers because that's how they make money. The more tokens they can serve the more money they can uh the more revenue they bring in. But this is also good for consumers because uh again as I said in the beginning right now consumers we're facing a lot of issues where uh our requests are simply just getting like rate limited or um they're just not they're just not getting served well by the provider and that's because they're not using their GPUs to the maximum to the most optimal way. And so for consumers what we're going to see is just uh better request handling overall.

And also there's going to be a cost thing as well because you can imagine how providers might um decrease the cost for like off peak hours uh so that agents can like uh are incentivized to go use that kind of compute for a lower cost and we already see this for service for example I think many providers have different levels of service low latency high latency batch which have different pricings and so finally this is kind of a plug for um our MMX CLI. So this CLI is not for humans to use. This is really a way for agents to autonomously call our model APIs uh because we have a range of models, you know, from speech to image to videogen to of course our LMS. And so for now, this is really just a way for uh agents to you know uh effectively call our model endpoints.

But in the future, we intend to build this out to be um to basically cater to what I said earlier about having agents manage their own compute more end to end and more sophistic in a more sophisticated manner. So maybe they decide to run a bunch of video workloads at different periods of the day and uh to save to save money and then to maximize compute. Oh, and that's it. Thanks.

Oh, &gt;&gt; cool. Thank you so much. &gt;&gt; Really appreciate that was a great talk. Uh, next we have uh Sid and Daniel who'll be introducing their company, the robot company. We've been talking a lot about agents, deploying them, coding agents, but what does it take to take an agent and deploy in the real world?

And so they'll be looking at how to deploy teleoperated robots in a physical environment. Hi. Hi. Hi. Hi. Is this Oh, it's working. Good afternoon. My name is

Daniel & Siddharth Krishnan (The Robot Company): teleoperated robots today, the embodiment data engine for autonomy tomorrow

Daniel & Siddharth Krishnan, The Robot Company (founders — likely embodied AI / humanoid play)

Summary

  • The Robot Company runs a contrarian thesis against the 'teleop is dead' consensus: deploy teleoperated bimanual arms into messy commercial sites (insect farms in Cambridge UK, laundry facilities, food prep, hospitality) so customers pay you to collect the highest-quality embodiment data in the industry — work done today funds the data engine for autonomy tomorrow.
  • Live demo on stage with two teleoperated arms mounted on a table; reference deployment is a black-cricket harvesting line at a Cambridge insect farm feeding the gecko/reptile pet trade — explicitly chosen because no human wants the job (the 'sai kang' use case).
  • Data-quality vs scalability framework: Simulation (most scalable, weakest sim-to-real) → Egocentric video (scalable, weak actuator mapping) → UMI/wearables (joint positions, decent both axes) → Teleoperation (highest quality embodiment data, worst scalability because 1:1). Maps directly onto the LLM Pre-training → SFT → RLHF stack with named robotics models: Pi 0.5, GR00T, Galaxea G0.
  • The 'autonomy gap' framework: SFT on teleop data gets you ~80% autonomy — the Twitter-demo cliff. 80% means '1 in 5 shirts hits the floor'; production breaks. References Figure's recent 8-hour package-sort livestream as a public failure of the same wall.
  • Three-step prescription: (1) deploy teleop to do real paid work and harvest embodiment data on the actual morphology + environment + task; (2) supervised fine-tune to ~80% autonomy on PI 0.5 / GR00T / Galaxea G0; (3) close the gap with 'telesupervision' — humans intervene in real time on robot mistakes, scaling 1-to-many (Waymo model) instead of 1-to-1.
  • Operator economics are brutal and underdiscussed: Siddharth trained ~100 operators in the UK; only 30-40% passed onboarding. This is the bottleneck the field doesn't talk about.
  • Cross-border low-latency teleop stack: live demo Singapore→London under 100ms; explicitly extrapolated to Singapore↔US, India↔Singapore, China↔Singapore — i.e., labor-arbitrage geography for teleoperators becomes a feature, not a bug.
  • Closing positioning line: 'An enterprise cannot think like a research lab. You have to think radically differently.' Telesupervision-as-deployment-layer is the wedge; data is the byproduct, exactly like LLMs and self-driving — 'the most valuable datasets are byproducts of real useful work done.'

Slides (17)

Slide notes

  • [t=5:29:44 / 19784s] Daniel opens on stage behind a live teleoperated bimanual rig on the table; background screen shows insect-farm deployment footage. Tag line: 'We deploy teleoperated robots today for autonomy tomorrow.' [Slides 3-12, t≈5:29:44–5:35:00] Title card: 'Siddharth Krishnan, Founder, CEO, The Robot Company — Closing the autonomy gap.' Picture-in-picture B-roll loops of the Cambridge UK black-cricket insect-farm deployment: bimanual arm scooping cricket substrate, gloved human operator hands intervening on a laptop teleop UI, live operator video feed on the laptop screen. Around slide 11–12 the B-roll cuts to a commercial laundry floor — piles of mixed garments on a long table, a single robot arm folding amid human workers — illustrating 'menial unpleasant sai kong work' as the wedge category. [Slide 13, t≈5:34] Meme slide: TELEOP gravestone 'TELEOP 2022–2025 RIP IN LOVING MEMORY' with the Drake-funeral guy whispering 'SENSORIZED HUMAN DATA' — Siddharth's setup for the contrarian thesis (the field has declared teleop dead; he disagrees). [Slide 14, t≈5:34:30] 'THE TELEOPERATION CEILING — Teleoperation is linear. 1 human : 1 robot.' Two failure-mode boxes: 'Operator Training' and 'Technical Limitations (Hardware/Latency).' Anchored claim: he trained ~100 operators, only 30–40% passed onboarding. [Slide 15, t≈5:35] 'Deploying hardware is hard.' Photo collage of failure modes labeled 'New Environments,' 'Bugs/Malfunction,' 'Things break.' The 'we get actual bugs because of the insect farm' line lands here. [Slide 16, t≈5:36] LLM stack mapped to robotics: horizontal arrows Pre-training → SFT → RLHF, parallel bar 'Data,' parallel bar 'Models' with three pills: 'Pi 0.5,' 'GR00T,' 'Galaxea G0.' The framework slide of the talk. [Slide 17, t≈5:37] Two-axis chart, y-axis 'Scalability,' x-axis 'Data quality & Hardware alignment.' Four photo-bubbles from low-quality/high-scale to high-quality/low-scale: Simulation → Egocentric → UMI / Wearables → Teleoperation. The core conceptual frame for the 'why teleop, despite its costs' argument. [t≈5:38–5:40] Verbal close not on slide: three-step prescription (deploy teleop → SFT to ~80% → telesupervision to close the gap) + Singapore↔London <100ms cross-border teleop demo + closing line 'an enterprise cannot think like a research lab.' Handoff to next speaker (Justin Bar) on going 'directly to the brain instead.'

Full transcript (1789 words)

Daniel. Uh, that's Sad. We are from the robot company. We deploy teleyoperated robots today for autonomy tomorrow. There you go. What you see in front of you over here are teleyoperated robots deployed in an insect farm in Cambridge in the UK. So you see the little box of like little squiggly things over there.

Those are black crickets that are used to feed geckos and reptiles. Uh you can imagine that not many humans like to work in this environment which is why it's a pretty good use case for robots. I've spent the past year deploying robots in the UK. So apart from insect farms also laundry facilities, food preparation uh and hospitality settings.

So we focus on deploying teleyoperated robots. And right now you might ask Daniel why deploy teleyoperated robots. If you know, you know a recent a prominent researcher, sorry, my clicker, a prominent researcher recently mentioned that teley operation as a means for data collection is dead. And there are a lot of merits to this argument.

Firstly, and I have firsthand experience of this, tell operation scales linearly. Tele operation scales one to one, right? One human controlling one robot much like this. The other thing is that oper operator training is actually really difficult. I've trained about 100 operators um onboarded them only about 30 to 40% have actually passed onboarding and it's really difficult to scale that.

Another bit is with teley operation you get all the technical limitations of hardware latency and all those problems. And then the second piece of what we're doing, deployment is incredibly hard. You get new environments, which means new lighting, new tables, new dimensions, and of course, new customer demands. You get bugs.

In our case, we get actual bugs because of the insect farm. But we also get bunch of software bugs uh and malfunctions. And with any hardware, things break. So why deploy teleoperated robots? Before I get into our thesis, let me quickly go through how models have scaled and what that means for us. So very quickly, models have scaled firstly with pre-training.

So large amount of data, generalized intelligence, broad but unrefined. Then supervised fine-tuning to get the data trained on the model. So the model has task specific specializations. Then a huge unlock ROHF reinforcement learning with human feedback. Humans provide the golden truth answer and therefore the model gives really useful and good output and all of this is underpinned of course by high quality data.

In the robot world high quality data or data more generally generally falls under four buckets. If I point to you to the yaxis and x axis the y- axis is scalability and the scalability is generally inversely correlated with data quality and hardware alignment. So on the left simulation data everything runs in simulation software no physical uh world no physical robot uh there's a bit of a sim to real gap then you get egocentric data essentially a camera place on a eye level that is pretty scalable as well because not super complicated to do that but generally the data might not map directly to robot actuators and servos so data quality is not super high you get wearables which is popularized by Umei the universal uh manipulation interf That is pretty useful because you get joint positions or any factor positions and then you can do some physics and math to ensure that that maps onto a robot. So decent data quality and also decently scalable.

And then on the other end of the spectrum is teley operations. Teley operation very high quality data because the actual robot is in the field collecting data uh but not scalable because onetoone and also bring a robot everywhere is kind of tricky. Now, understanding models and I say understanding data. How do we get to useful deployment?

How do we get to useful deployment and useful work in the LLM space? What that looked like, and I'm being extremely reductive here, uh looked like an API call, right? Obviously, there's a lot more underneath that, but think about robots. Deployment is a lot harder and a lot tougher. How people have approached the problem robotics kind of looks like this.

The LM approach take data take compute and throw it in throw at a problem pre-train and SFT and that has had really really good results like recent models have shown really wonderful promising results in the lab often rely relying on simulation data ecoentric data often with some world models involved uh and that that has had a lot of you know high quality evals in the lab but how do we achieve and fix autonomy gap not just in the lab but in the real world. Our thesis is that we want to deploy robots in commercial settings and that does two things. Firstly, when you deploy a teleyoperated robot, you actually get real useful work done for customers, right? So, in this case, folding a t-shirt.

But this process also does something extremely useful that it collects very valuable data based on the work done by the robot. As we've learned from LLMs and self-driving, the most valuable data sets are byproduct of real useful work done. So that brings us to kind of step one. Actually, Chenise right here was supposed to give me a bottle of water, but deployment's hard and that didn't really work today.

But what I wanted to say was we were basically trying to uh we start every deployment by putting a teley operated robot into real scenarios. So you can see the guys out here folding clothes and you can also see Daniel doing a live demonstration of what that looks like. And above that what you get is we layer it on top with your you know pre-trained models that you might be already familiar with. Think PI 0.5 Groot some of the models that Daniel already shared about.

And that data that you get is essentially the highest quality embodiment data that you can get, right? And because the morphology matches um the environment matches and the task also matches, what you end up getting is a very good base foundation data set for you to actually deploy commercially viable uh for you to deploy commercially viable robots. And you have to remember this, all of this is just the starting point, right? The real work begins once you start getting into fine-tuning.

I think step two is the part where everyone in this room already knows how to do. Um you can take teleop data, you supervise fine-tune it on some of the models that you already know about, right? And you can kind of achieve about 80% autonomy and we all know what 80% autonomy looks like. We've seen these on Twitter on many social platforms.

Um what you end up getting is a really beautiful video with some hype and you know that works well when you want to garner attention but once you start getting into the real world and I'm sure there's a lot of enterprise uh folks here um 80% just doesn't cut it for production when you start getting 80% when you hear 80% in EVEL and we start getting to production you know what that really means for the customer that means that one in every five clothes falls on the floor of the customer site when they're trying to fold it, right? And that just doesn't cut it. So, what you have now is really a gap that doesn't ship, right? And this gap is called the autonomy gap.

You can kind of see figured they did a recent demonstration, a live stream actually of their robot kind of sorting packages. It was very impressive was doing it for eight hours, but they ran into issues too. And we believe a very specific mechanism, human intervention, real time could solve this problem at scale. So that brings us to step three, telly operation plus human intervention.

There is a terminology for this and it's called teley supervision. And teley supervision basically involves the idea of someone intervening when the robot makes a mistake. You make fine corrections and then you just let the robot do its thing and you keep iterating every time it makes a mistake. And how do you address the telly operation ceiling that we now have when you want to do this telly supervision?

Well, we can start by scaling from one to one to one to one is to many. And this isn't new. The self-driving world has been doing this for a while. Whimo has um you know examples of of of teley supervision and we believe the same could extend to robotics. And the other side is remote teley operation. We have a working stack that unlocks crossber low latency telly operation.

This is an example of us doing a demonstration from Singapore to London. You can now extrapolate. You could do Singapore to the US, India to Singapore, China to Singapore. All under 100 milliseconds on our stack. Now for enterprises, this is key because deployment is hard, but it's very necessary. The long tale of robotics lives in the real world.

And that 80% is the cliff's edge. So what we're trying to say is that telly operation used as a deployment layer combined with the menial unpleasant Saikong like work that you need to do right is what makes successful deployments. And the way you need to do this is that you have to think differently. An enterprise cannot think like a research lab.

In fact, you have to think radically differently. And you need to start with telly operation as your your your fundamental starting point. And then you start collecting rich data and then you start deploying commercially viable models and robots. And that brings us to the end. So that's what Daniel and I do at the robot company.

We deploy robots that do real work today as we build the data engine for autonomous robotics tomorrow. So if you want to learn more about us, you can find us at the robot company.ai. Thank you. That was an amazing demonstration and I think uh you know it's just a testament to how complex it is to deploy robots in the wild.

So we talked about how we can tell operate robots you know have people actually help but what happens if we bypass that and go directly to the brain instead and so in this specific portion we'll be talking about Justin bar will be sharing about how you can do that

Justin Baird (Tesseract) + Kai Ming: shared autonomy, BCI painting, and embodied creative agency

Justin Baird, Founder, Tesseract (creative-tech / robotics-art collective fusing multimodal AI, robot arms and consumer BCI to extend human creativity) with Kai Ming (AI policy researcher, Rare Disorders Society of Singapore; lives with Alstrom Syndrome, lost hand dexterity for painting)

Summary

  • Live, on-stage demo (the only one of the day): a Tesseract robot arm paints a canvas of 'Hope the sloth' (RDSS mascot, born with two fingers) while Kai Ming - in a wheelchair, wearing a Muse consumer EEG headband - completes the final strokes using brain-control input. Talk title on lower-third: 'Mindflow: Enabling Creative Agency via Shared Autonomy and Synthaesthetic BCI.'
  • Tesseract.art / Tesseract is a ~3-year R&D project by Justin Baird and Dr Richard Savory: a robot arm ('Tessa') plus multimodal-AI pipeline that translates one creative form into another (originally live music -> painting), now pivoted to BCI-driven control. Framed as 'human imagination extended through intelligent systems' / Synthaesthetic Art.
  • Shared-autonomy thesis: the human supplies intent and creative agency via a low-bandwidth channel (EEG focus/attention signals); the robot+model supplies motor execution and aesthetic stroke generation. Neither alone makes the painting - the autonomy is genuinely shared between operator and machine.
  • Hardware-collapse claim that matters: the BCI used is a Muse consumer headband (~$200-400 retail), wireless, dry-electrode, ~30s to fit. Replaces what Baird says used to be '2 hours of putting on a headset and all this expensive equipment.' Brain-interface partner on stage is Jackie from 'Mind Interface Company.' Ivy from Tesseract also assisted.
  • Kai Ming's story is the emotional payload and also the use-case proof: AI policy researcher, has Alstrom Syndrome (rare disease causing progressive loss of dexterity), came from a family of artists (granddad and sister), 'grieved my hands, grieved my passion' when she could no longer paint or write. The BCI+robot rig restores creative output through a non-motor channel.
  • Painting symbolism (Kai Ming narrating): Hope the two-fingered sloth from Singapore Zoo, with a heart in two colors representing parents who carry rare-disease patients; wings represent the support community. Painted as one continuous gesture - 'one stroke is the dad and one stroke is the mom.'
  • Pivot to the bigger frame - AI as employment expansion, not replacement: Baird closes with the Dawn Avatar Robot Cafe (OryLab Inc, Japan), a permanent cafe in Tokyo where bedbound or housebound people remotely operate OriHime / OriHime-D bunshin ('alter ego') robots to wait tables. Cited as a real-world template for human-in-the-loop AI as a labor-market opening for people with disabilities, not a closing.
  • Positioning line for the room: 'we needed a break from all the fear and money chasing with something a little bit positive' - explicitly stages the segment as the conference's counter-narrative to the doom/AGI-replacement thread that ran across Day 2.
  • Stage logistics confirmed the production cost: full robot-arm rig wheeled out mid-talk, easel + canvas + paint cart, brain-interface partner on stage, RDSS branding on Kai Ming's shirt. Built in roughly the last month before the conference.

Slides (32)

Slide notes

  • Segment starts at 5:41:06 (t=20466s) and runs to 5:55:52 (t=21352s, 14.8 min). The deck for this talk is intentionally minimal - mostly stage shots of the live robot-painting demo - so the load-bearing content is the demo itself, not text slides. Documenting the most signal-bearing 14 frames below. [t~5:41:30 / slide 04] Speaker cam: Justin Baird at podium wearing Tesseract-logo t-shirt, holding mic - sets up that he has '10 minutes and lots of things to show you.' [t~5:42:30 / slide 08] TITLE FRAME: 'Synthaesthetic Art - human imagination extended through intelligent systems' - white text on black, robot arm + easel + Kai Ming in wheelchair already on stage in foreground. This is the talk's thesis slide. [t~5:43:00 / slide 17] BRAND FRAME: Tesseract.art logo (purple hexagonal cube) full-screen with the robot rig, Kai Ming in wheelchair, and the giant 'AIE' letters in shot. Establishes the project identity. [t~5:43:45 / slide 01] Stage shot: Kai Ming entering on stage in a manual wheelchair, wearing the Muse EEG headband on her forehead. The headband is the single most-important visual fact of the talk - this is what makes the BCI channel commodity-grade. [t~5:45:00 / slide 16] Stage shot: Robot arm 'Tessa' on a wheeled cart with paint, easel, canvas; Kai Ming and Baird in conversation; the painting of 'Hope the sloth' visibly in progress on the canvas (greys, blacks, pink wing strokes already laid down). [t~5:46:30 / slide 03] Kai Ming on mic, smiling, wearing the Muse headband. This is where she narrates: 'I'm Kai Ming. I have a condition called Alstrom Syndrome. I'm part of the Rare Disorders Society of Singapore which Justin has been working with. I'm an AI policy researcher and that's how we met.' [t~5:47:30 / slide 20] Wide stage shot: Baird gesturing at the robot rig, Kai Ming in wheelchair beside him - the shared-autonomy setup in one frame (human intent on right, robotic execution on left). [t~5:48:30 / slide 19] Kai Ming on mic - this is the emotional peak of the talk where she says 'my condition has kind of made me lose a lot of my dexterity in my hands... I wasn't able to paint anymore... I grieved my hands, I grieved my passion. And to suddenly have this outlet, it's just amazing.' [t~5:50:00 / slide 13] Close-up of the canvas mid-paint: 'Hope the sloth' rendered in grey/black/brown body with five pink wing/finger strokes below. The robot arm gripper holding the brush is visible mid-stroke. This is the actual artifact being made on stage. [t~5:50:45 / slide 21] Close-up of robot arm executing a continuous magenta stroke - the heart line that Kai Ming describes as 'one stroke is the dad and one stroke is the mom.' [t~5:51:30 / slide 14] Tighter close-up of completed painting: heart now closed around the sloth, parental strokes complete - the finished work. This is the deliverable of the live demo. [t~5:53:00 / slide 25] CONTENT SLIDE (one of only two real info-slides in the deck): 'Our "BUNSHIN (Avatar) Robot Cafe" is a permanent experimental cafe operated by OryLab Inc, where people who have difficulty going out for various reasons remotely operate our bunshin robots OriHime and OriHime-D from their homes and hospital rooms, to provide cafe services.' Shown alongside the Dawn Avatar Robot Cafe logo and an interior photo of customers being served. THE single most-citable artifact in the segment. [t~5:53:30 / slide 26] Same DAWN deck but the right-hand photo now shows a bedbound operator (bed, blanket, headset/visor) remotely operating an OriHime robot from a hospital room. Visually carries the 'AI as labor-inclusion, not labor-replacement' argument in one frame. [t~5:55:30 / slide 32] Closing wide shot: empty stage, AI Engineer Singapore signage and giant 'AIE' letters, demo equipment being wheeled off. The transition slide into the next speaker (Arvin / Bifrost - synthetic worlds for robotics training).

Full transcript (2002 words)

Hello everyone just getting this started. Um, thanks for joining today. We have uh another interesting robotics uh experiment to show you. So, we're getting there in a moment. But while they're connecting, I'll just get started. We've got lots of things to show you in the next 10 minutes. So, uh uh get ready.

Um but thanks again um for making this happen here in Singapore. I mean, AI.jer coming to Singapore is amazing and having um Agram and Sherry uh bring this all together with the 65 Labs team is great. You want to just hit play? Uh that one that you just minimized. Are you guys getting this or no? &gt;&gt; Hold on.

Extended. &gt;&gt; Yeah, extended. It is extended. It's extended. &gt;&gt; Now you get it right. &gt;&gt; Okay. 3, two, one. All right. Thank you, everyone. So, as part of Tessact, we've built a system. We call this, it's called Tessact.art. Um, and what we've done with this is we've built a system that allows people to express themselves through AI.

And this started out by having live music performance and turning that live music performance into a painting. Um, but from that we've kind of moved this uh much further along. And so I'd like to call out um Kaiing. Kaing, would you like to come out with us and we're going to start rolling out some equipment?

Thanks. Um, everyone, I just want to introduce Kai Ming. Um, we've done some quite interesting and special things together for this. Thank you. And so for so for the last two years please everyone if you can roll out you guys going to roll out everyone. Sorry we got lots of things rolling out guys. Sorry can you guys help roll out?

Thanks. Okay sorry this is quite difficult to do in like a 10-minute presentation when we have like a full robot system and painting and all this other kind of stuff. So please bear with us one second while this is happening. But um as you'll see, we're bringing out um a system that we call tessoract.org.

And what Tessa is, the robot arm, Tessa, the robot arm, we've been developing this over the past two to three years, uh with a couple collaborators, um my collaborator, Dr. Richard Savory and I started this about three years ago and we wanted to build a system that would allow us to be able to use robotics along with multimodal AI to be able to take one let's say creative form and turn it into another and that's where we started with this in terms of um bringing music together and what we're doing with this is really taking human imagination and extending it through intelligent systems and that's the intention of what we've done here today. Now, what we also have on uh stage, we've got Jackie also here from &gt;&gt; Mind Interface Company, &gt;&gt; and we have Ivy, who is also attending here with us from Tessact. And Ivy, I might just ask you to come uh up and help.

And so with Kaiing, what we've done, what you're seeing here live on stage is the very first painting that um Kaiming has painted using brain control in her face. So, it might be hard for you guys to see in the back, but she's actually wearing um a head a headband that comes across the front. Um it's a Muse if if anybody in the audience knows the Muse headband. But what's so fantastic and amazing about this is that this technology is now to the point where it doesn't take two hours of putting on a headset and like all this expensive equipment.

We can literally put this on and Kaiming can just think about what she wants to do in terms of the control interface and actually make things happen with the painting. Um, so Kaiming, I'd love to hand the well ask you a couple questions. Um, maybe you can just tell us a little bit about how we got here today. &gt;&gt; Okay. Um, hi.

So, I'm Kaiming. Um, I have a condition called Alist Syndrome. So, I'm part of the Red Disorders uh, Society of Singapore, which Justin has been working with. Um, so I'm an AI policy researcher and yeah, that's how we met. &gt;&gt; Yeah. And so, um, you've done some artwork in the past, um, and, um, what we're now able to do is bring, let's say, some of your creativity back, um, through this process of using AI and our multi multimodal systems.

So, what we planned to do was we've been painting this painting. Maybe you can tell us a little bit about this painting. &gt;&gt; Do you want to move forward? &gt;&gt; Can you hold it? &gt;&gt; Yep. Yeah, that's fine. Thanks. So yeah, I've been painting since I was a kid with my granddad and my sister who are both artists as well.

Um, and it's something that's really connected me with the world. Um, and my condition has kind of made me lose a lot of my dexterity in my hands. And so I wasn't able to even write anymore and I still kind of can't. And so I wasn't able to paint anymore. And I went into anthropology hoping to kind of live vicariously through it.

And that's how I ended up in AI policy. But you know, it's I grieved my hands. I grieved my passion. And to suddenly have this outlet, it's just amazing that it's it's kind of been brought back to life. &gt;&gt; Awesome. Thank you. Thanks. Yeah. Okay. And so now for the moment we've been all we waiting for is we're actually going to see if we can get because literally we brought this we this whole thing has come together over the past month.

So we're going to have um uh Kiming try to finish one of the final lines. So with this painting this is of Hope the sloth from the RDSS. Did you want to speak about that? &gt;&gt; Oh yeah. &gt;&gt; So Hope is a two a sloth that was born with only two fingers. He lives at the Singapore Zoo and he's kind of like us.

We take life kind of slow and steady. And um this is a painting of uh hope this love code around a little finger. And um the two colors that you'll see on the heart and the wings that's the parents that we uh you know who support us uh red is order kids and um yeah. &gt;&gt; Yeah. So there's hopeless sloth.

So let's try it. Ready? Great. So, uh maybe you can tell us the concept was here for there to be a heart that surrounds Yeah. &gt;&gt; Yeah. And so the heart it's like you know one stroke is the dad and one stroke is the mom because you know we often forget like how much the parents in our community uh support the our patients with rare diseases and they do so much.

It's just incredible and you know I really want to thank thank Justin and his team for giving this back to us this kind of freedom and liberty to do what we want to do with our lives. Yes, thank you. Thanks so much. And I'm I'm actually quite surprised to know that we have three minutes left to actually finish our conversation which is great because anyhow so I think one of the things that's really become an inspiration from this and what I think was the important message that I wanted to leave you know everyone who's seeing this really for the first time happening is um we started this process thinking about using um AI to give people creative superpowers, right?

We want to not have AI take creativity away. We want AI to give people superpowers, AI superpowers, creative and fun things. And we did that starting from music. And what we've done now is pivoted towards the brain control interface and being able to make this wireless system happen. But you can ask like creativity, it's great.

It's a part of self-expression. It makes the things that um you know it's a very human thing to be able to express yourself and to have this form of communication. But what's even more inspiring and I want to just show one thing um as well. What's even more inspiring is what could we do with this type of technology?

Um we talk about AI taking people's jobs away. Um, what I see with this and our collaboration together is we're giving new opportunities for employment to people that perhaps have not been able to be employed because of maybe having the disability or not being able to um be as mobile as others. So now imagine that as this technology develops, this is literally just the beginning of making these things happen. We could see employment um becoming something because we need AI as uh you know AI in most of these systems today need a human in the loop right so if you can think about AI being something that is providing the opportunity to do something you know let's say it's a dark factory it's all automated but there needs to be people to supervise it there needs to be people to do some of the work and just today literally through this process um I found out about um a very special um place that um sorry that that built a very special place in Japan uh where this is already happening.

I literally just found about today. So this is an avatar um an avatar robot cafe. But what's interesting in this case is that the robots are um are fully managing things, but there are people making these robots work. And the robots are serving customers. The robots are serving customers, but they're actually um being uh they're employing people that are not able to leave their beds perhaps or leave their home um to actually have gainful employment.

And so I think this is a great perfect use case example of um the opportunity that uh could present itself um with this type of technology as this progresses in the future. So I think that's a really amazing and inspiring um opportunity to think about how AI is going to completely open up a new opportunity, a new work workforce um for people that might not have been able to be employed in the past. Anyhow, so thank you very much for having us. Really uh been amazing to be a part of AI engineer.

Um thank you um Jackie um for making the brain interface happen. Um and any closing words? I think we all needed a break from all the fear and money chasing with something a little bit positive. &gt;&gt; Thanks very much. Thank you everyone. Thank you AI engineer. Appreciate it. Thanks. &gt;&gt; Do we want to help you go off the stage this way?

Make sure you talk about All right, that was an amazing presentation. I think especially in the doom and gloom of AI like that offers like so much hope. So, we've looked at how you can use BCI and for the next presentation, we're going to have Arvin from Bifrost where they build synthetic walls to train models. They've been working with some of the largest robotics companies in the world, helping them do things all the way to landing robots on Mars.

They're backed by Seoia and also the CIA secret venture fund. One really cool fact is that the previous robot company and Bifrost are both Singaporeans companies that were started, incubated, and really born in Singapore. And with that, really excited to have Arvin take the stage. Awesome. Sadly, I do not have any cool robot demos for you guys, but that was pretty

Evals for Robotics: Sim-Generated Worlds, Domain Randomization, and Fail-Fast Edge Case Discovery

Aravind (SK) Kandiah, CTO & Co-Founder, Bifrost (sim-generated worlds platform for robotics evals — counts US Air Force, NASA, Anduril, Saronic, Mitsubishi Heavy, Honda, ST Engineering, NTT Data as customers)

Summary

  • Frames the 'robotics deployment gap' with a single number: 90% success in the lab collapses to ~40% in real-world deployment because training/test distributions don't cover the dynamic chaos of the field (mirrors, glare, people, plastic bags, edge intersections).
  • Pushes back on the naive bitter-lesson framing: most additional robotic data is 'empty calories' — the 1000th highway scenario adds zero signal; what's needed is edge-case coverage across lighting, layout, weather, traffic, and rare events.
  • Reliability is the only metric companies actually care about — success rate × number of scenarios — and it scales as N × N when you want to ship across N domains. The race in robotics is for the scaling law on reliability itself.
  • Walks the eval stack in three rungs: (1) manual staging with humans + robots + real time — bottlenecked by all three; (2) Gemini Robotics ER 1.5 + AutoEval (UC Berkeley/NVIDIA) replace human evaluators with VLMs and auto-reset policies — removes the human but still bottlenecked by robots and time; (3) full sim using real data to generate the simulator — breaks the linear scaling.
  • Core mechanism: take real captured data, lift it into a domain-specific generative simulator (e.g. boat in a glare-lit marina), then parameter-sweep across operational conditions (sea state, fog, time of day) to spawn N scenarios from 1, then product-sweep across criteria to get N × N coverage — 'domain randomization' beyond what real data could capture.
  • Edge-case discovery method: 'Fail fast in sim, verify in real.' Run dense coverage in simulation (green/red grid), then direct expensive real-world testing only at the red cells where the model failed. Converts capex on robots/humans/time into compute spend.
  • Reframes the meme: 'data flywheel' is misleading because it ignores refinement — what matters is a 'data refinery' that loops real → sim and surfaces what should be collected next, weighted by failure signal.
  • Bifrost productizes this as a browser-based world-simulator where customers 'break their physical AI' before shipping — positioned across maritime, aerial, robotics, land, and off-world (space) verticals.

Slides (18)

Slide notes

  • [t=5:55:52 / slide 1] Title slide: 'Aravind (SK) Kandiah, CTO & Co-Founder, Bifrost — Evals for Robotics: Measuring physical intelligence and deploying robots.' [t=~5:56:30 / slide 2] 'Robotics Deployment Gap' — large green '90% In the lab' vs red '40% In deployment'. The headline number for the talk. [t=~5:57:30 / slide 3] 'What Causes The Deployment Gap?' — distribution plot, y-axis 'No. of scenarios', x-axis 'Types of scenarios', single narrow Train distribution shown. Sponsor row visible: Arize, Google DeepMind, Cursor. [t=~5:59:00 / slide 4] 'Manually stage scenarios to test how their systems handles different scenarios' — boat-on-water clip + robot-arm-on-yellow-table clip. Caption: 'Number of test scenarios is bottlenecked by: Humans, Robots, Time.' [t=~6:00:00 / slide 5] Reliability vs Compute graph — red 'Human-Robot-Time' line scaling linearly with compute. Sets up the scaling-law framing. [t=~6:01:00 / slide 6] 'Replace human evaluators with VLMs and automated scene resets' — shows Gemini Robotics ER 1.5 querying 'Is the door open?' on a Spot-style quadruped, plus '3. Execute Reset Policy (learned)' on a robot arm. Caption: 'Number of test scenarios is bottlenecked by: Robots, Time.' Sources: Gemini Robotics ER 1.5, AutoEval (UC Berkeley, NVIDIA). [t=~6:02:30 / slide 7] 'Close the sim-to-real gap using real data to generate the sim' — grid of real boat photos → simulated boat renders. Real Data | Simulated Data split. [t=~6:04:00 / slide 8] 'Real → Sim' — distribution plot now shows Train (narrow cyan) + Sim (green, slightly wider) + Test (yellow) + Deploy (red, very wide). Sim distribution overlaps Train but extends toward Test. [t=~6:05:00 / slide 9] 'Parameter sweep across your key operational conditions' — '1 Scenario' column expands to 'N Scenarios' column; two simulated marina/buoy images at different conditions (foggy sunset, snowy daylight). [t=~6:06:00 / slide 10] 'Expanding to all operational scenarios' — N × N grid of scenarios with axes labeled across sea states (High Sea State: Sunny/Clear/Rainy/Foggy; Low Sea State: Sunny/Clear/Rainy/Foggy) and other criteria columns. The N × N reliability scaling visualized. [t=~6:07:30 / slide 11] 'Fail Fast in Sim. Verify in Real.' — left grid: dense green/red 'Test everything in sim'; right grid: sparse red dots 'Focused testing in real'. The core edge-case discovery method. [t=~6:08:30 / slide 12] 'Real + Sim: Catch Failures Early With High Coverage Test' — full-width green Sim distribution underlaying Train (cyan) and Test (yellow) curves. [t=~6:09:30 / slide 13] 'Data Refinery' — 'Data Flywheel' struck through; Real ↔ Sim loop diagram. The quotable reframe. [t=~6:10:00 / slide 14] Bifrost logo slide: 'Battle-Tested with the World's Most Demanding Customers' — logos: U.S. Air Force, NASA, Anduril, Saronic, Mitsubishi Heavy Industries, Honda, Privateer, ST Engineering, NTT Data, Havoc, Striveworks. Vertical thumbnails: Maritime, Aerial, Robotics, Land, Off World. [t=~6:10:30 / slide 15] 'Simulate the world in your browser & break physical AI' — closing CTA for Bifrost product. [t=~6:11:00 / slides 16-18] Stage handoff to MC + intro for next speaker (Julia Kim, Open Graph Labs) on multimodal data-collection sync for robotics training.

Full transcript (2321 words)

And co-founder over at Bifrost. And today I'll be sharing a little bit about the state of robotics, right? I'm sure you guys would have seen a whole bunch of cool videos online of, you know, robots dancing at like Chinese New Year, doing back flips and all these kinds of cool stuff. But on the other hand, you also see robots doing a lot of weird clunky things where they're running into mirrors and just causing a lot of havoc, right?

And sadly, this is what we consider the the robotics development gap, right? Essentially, what's happening is you're getting really really good performance in the lab, right? It can do all these crazy things, but when you actually deploy them into the real world, what you find is that the performance of these models drop very very severely, right? So why exactly does this deployment gap actually exist right?

So what you guys seeing on screen, I promise there won't be a lot of graphs today, but there like two graphs. This is the first one. Uh what you guys are seeing on the x-axis is just all the different types of scenarios, right? And this is just, you know, your training data, your testing data, and like your deployment data.

And on the y-axis is just like the number of scenarios in your training data, right? So when you go out, you know, you collect a whole bunch of training data, this is typically like what a distribution would look like. Uh, of course, this is simplified. And then you have your test distribution, right? So you have a training data set, you have your test data set, there's some overlap, but also some parts where they don't overlap.

And then when you actually deploy your robot, what you find is like the types of environments and all the different types of conditions that it actually encounters in the real world, it's actually very different from the things that happen in the lab. In the lab, everything is very clean, very organized, but in the real world, there's so much dynamic chaos. There's like people walking into the scene, there's reflection from mirrors, there's glare in the camera. All these are what we consider out of distribution scenarios and this is where robots fail, right?

So, you know, most people will say like, hey, let's just throw more data at it. Like, you know, the bitter pill lesson, just more data, it should be better. But the reality is a lot of the data that you actually collect from robotic systems, they're actually considered empty calories, right? Because they're not adding any new additional signal.

A lot of the times you're collecting the same scenario over and over and over. Think a self-driving car driving on a highway. You don't need more highway scenarios. What you need is more edge case scenarios. It's like a cow crossing a complicated intersection, a plastic bag that's right in front of the rear view mirror as you're backing into a car park, right?

These are the kinds of things that you actually want, right? So, in reality, when you want to be able to test these systems, you don't just need one small distribution or one small type of tests. You need to be able to go in and like get all these different types of distributions and cover as much of the scenarios as possible. So like every kind of lighting condition, every type of different um spatial layout of the scenario, right?

But getting this is really really hard and if you can do it, you can prevent uh failures from happening um in the field. This becomes extremely tricky because now we are entering the age of generalist policies. Robots that are promising the ability to do anything and everything. Everything from packing your dishwasher to folding your laundry to even doing things in medical, healthcare and science.

And now when you want to validate these systems, it becomes even more tricky. All right? So in the field we have a very simple uh way of like giving them like essentially like a reliability score. And this is the thing that the thing that most people care about when they think about deploying robots is what is my true reliability when I deploy these systems into the real world.

And reliability really just is like you can take a success rate which is if I do the task a 100 times, how many times am I getting it right? And you're also doing it across all the different scenarios that you want to be able to ship your robot for. Right? So if you're you want to be able to handle like a thousand different scenarios, you need to do that a thousand times a thousand and it scales very very quickly.

Right. And all these companies are now racing. They are racing towards how can I achieve reliability faster, faster than the competitors, faster than the market. And they are trying to figure out what's like the scaling law in reliability itself. Right? So the first way they test robots is pretty straightforward.

I'm sure you heard some talks where you know they will manually stage stuff. They will get humans, they'll get robots and they do everything in real time, right? They set up the scene manually and they actually get the robot to do the thing. But in this case, the number of scenarios that you can actually test for is bottlenecked by humans, robot, and time.

Right? So when we actually put that on a graph, this is a different graph, but on the bottom axis, you're seeing compute and the other axis you're seeing reliability. Every time you do an inference, you're spending some compute, but you're still bottlenecked by how many humans you have, how many robots you have, and how much real world time you have. As a result, you're still scaling uh linearly, right?

But then folks come along and like, okay, no, I'm just going to sample a few different test cases and I can get some additional new tests. It's good, but not great because you can't get a lot of distributions because they're still manually doing a lot of stuff. And then folks say, okay, you know what? If we remove humans from the evaluation cycle, right?

So now folks are using things like Gemini. Uh so Gemini robotics you can look at a scene and it can give you qualitative feedback on like hey uh did it actually complete the task successfully how far away is it and they also have things where you can autoreset the scene using another large uh vision language model or vision action model as well right so they've removed humans but you still have you're still bottlenecked by how many robots you have and how much time you need right so it becomes a bit slightly faster because now you can spend a little bit more compute and speed it up and you don't have to be reliant on humans as But you're still scaling linearly. All right. And what this means is you can just do slightly a few more tests.

And then of course, you know, like oh, you know, when we build bridges in the real world, we test it in simulation first and then we build the bridge and we do that all that um simulation for like mechanics and like tension and stuff. Why not do the same for robotics? So in robotics there's a thing called simto to rail gap which is when you do things in simulation they don't always line up with reality right and this is like a big problem that the industry is trying to solve and surprisingly enough in the last year we have a lot of new ways to solve this and the biggest one uh that we are working on is actually using the real world to generate the simulator itself right so what that actually looks like is you can take in real data right so you take in real data into the And you can generate things from that rail data and then you can reimulate the world from that. Right?

So this whole idea of you are generating a similar simulator specific for your domain and your thing every single time. Right? It's not just objects. You can generate entire worlds for your specific domain. For example, if you're like off-road self-driving car and you're operating in the California desert, you can very quickly generate that entire world and train in that simulation.

Right? So this is how you begin to close that sim to real gap. And what this allows you to do is it allows you to copy the distribution of your actual test set and have a simulated version of it. And this is already valuable because you can now do closed loop testing with this distribution. But how do we go even further?

Right? This is is not great coverage. Right? So let's just take one specific scenario. So like this is an example of the type of data we generate. You know, here it's like a boat is driving to a crowded marina. The glare is in the screen, uh, is in the camera and everything's a bit chaotic, right? But this is just one specific scenario.

How do you scale this up to more scenarios, right? So, what we can actually do is we can go into the simulator and we can parameter sweep across all the different operational conditions and it's almost as if you're seeing a thousand different realities very, very quickly and you're testing the model against all these different realities. Uh simultaneously, right? And from there you can expand it even further, right?

So it's not just um a n* n test. You can scale it up to a lot of different domains and criteria. And the cool part about this is that you can test your AI model against it and you can immediately see where your AI model will be failing even before you have shipped your robot into production. And the whole idea here is simple, right?

Fail fast in simulation and use those failures and direct them for real world testing. So you're not testing on every single thing, but you're testing very specifically on the places where you failed in simulation. This way you spend less capital, you're more optimized and efficient with the resources that you do have as well. And you know, we're just limited by real world time as well, right?

Right? So we go from this to this because now we can cover a much much wider domain. And there's a term called like domain randomization, but basically you're covering a much wider domain than real data could ever possibly cover. And it's a very good way um to do these tests. So you know, everyone, I'm sure, would have seen this thing called like the data flywheel.

It has become a meme at this point where every company's like, "Yes, we have a data flywheel." But a flywheel doesn't actually capture the most important thing. And the most important thing is you actually need to refine this data. The data needs to be super high quality. You need to figure out a way where you're finding the most valuable things and you're also being able to drive what you should collect in the real world as well.

Right? At Bifrost, we help some of the world's most demanding customers do this at scale. And we are essentially taking all of this and we are simulating it in your browser. So we have a world, you can simulate the world and you can break your AI model inside of it. Thank you folks. That was an amazing talk, especially talking about like a data refinery.

It's trying to trying to make sure that your data like covers all the different edge cases. So, I'm really excited next to have Julia Kim from Open Graph Labs talking about how they built an in-house stack where they've ensured that you can sync the data collection across the mount different multimodalities. And this is really difficult because even microsconds of of drift when you're collecting data for training robots can end up being really damaging when you actually take this and you train your models. So really excited to see how that goes.

Uh while we sort of have a bunch of these uh technical difficulties, you know, I'm wondering like how have you guys been finding like today's conference? You know, I think personally for me it's been like absolutely like stunning. Like I was really blown away just now when like Justin demoed the ability to paint just using like a brain control interface. Like I never thought that was possible because I've only been playing a lot with agents, right?

I see like text in text out like oh my god like we're going to just everyone's jobs are going to be automated away. And it's really cool and inspiring to see like AI being used for good. And so I think like that's been something that's been really exciting to see just the sheer diversity of like opinions and projects that people are like working in. I think so. &gt;&gt; Um, ourselves as a team uh using a lot of the tools that uh the speakers and sponsors have built.

Um, so we'll take that as a note. Oh, okay. I think we're back. Yeah. &gt;&gt; Drag it. Yeah, it's extend. So, &gt;&gt; we had it just now. &gt;&gt; Oh, it's back. It's back. &gt;&gt; Yeah. Okay. Good. &gt;&gt; Thank you. &gt;&gt; Hi. Good afternoon, everyone. &gt;&gt; Good afternoon, everyone. Uh, my name is

Julia Kim (OpenGraph Labs): Training Humanoids by Sensorizing Humans — The Missing Touch Modality

Julia Kim, CEO & Co-Founder, OpenGraph Labs — building the hardware/software standard for tactile data collection so robots can train on the full human sensorimotor loop.

Summary

  • Frame: hundreds of companies are now paying humans to film first-person 'egocentric' video for robot training — Indian factory workers wearing head-mounted cameras at scale, similar setups across CNN-documented programs, all selling the data to humanoid training pipelines.
  • Why now: NVIDIA's recent EgoScale research proved egocentric human video works as humanoid pre-training data — fine-tuning on a small human-robot alignment set plus a few teleop demos got a robot to one-shot-fold a shirt. A scaling law for physical AI, mirroring LLM scaling.
  • The structural argument: humanoids have closed the embodiment gap to humans (same form factor, similar DOF), so human behavioral data is now the most direct supervisory signal possible — and it is captured in the continuous, dynamic, physically grounded real world, which is higher-fidelity than any simulator.
  • Two futures fork: 'robots as utility' (warehouses, factories — don't share our space) vs 'robots as companion' (fold laundry, help your parents, hand you a glass of water — share our world). The companion future requires babies-style sensorimotor learning: vision + touch + proprioception + audio + action/feedback loops.
  • The thesis: sensorize the humans. Vision (egocentric cameras), motion + 3D hand/wrist pose + body trajectories (reconstructible from video), proprioception (from video), audio (camera-native) are already covered. One critical modality is missing at scale — TOUCH.
  • Why touch is missing: every other modality piggybacked on the standardization of the RGB camera and decades of compounding hardware/CV ecosystem investment. Touch has never had its 'camera moment' — no unified hardware stack, no shared data format, no scalable collection rig.
  • OpenGraph Labs' wedge: two coupled layers — (1) a wearable tactile-capture glove producing high-fidelity fingertip contact signals, claimed 7x cheaper than current options with consistent sensing signals; (2) a tactile encoder model trained on the scaled dataset that ingests raw tactile signals and converts them into meaning humanoid policies can train on.
  • Pitch: be the standardized substrate for touch — the equivalent of 'RGB camera + ImageNet' for tactile data. Thousands of people, millions of contact events, digitized, ready to feed the next generation of humanoid models.
  • Closing line: 'Let's train the humanoids by sensorizing the humans.'

Slides (15)

Slide notes

  • [t=6:10:11] Slide 1: Title — 'Have you heard about Egocentric data?' / Julia Kim, CEO & Co-Founder, OpenGraph Labs / 'Training Humanoids with Sensorized Humans'.
  • [t=6:10:35] Slide 2: Same title overlaid with a viral social-media screenshot: 'Indian workers are shown wearing head-mounted cameras to record egocentric data, a verified practice used to train embodied AI and robotic models... industry trends point to these recordings being used for robotic imitation learning.' Comment thread shows 6.4K likes, ~36 replies — establishes that this is happening at scale and is public knowledge.
  • [t=6:11:10] Slide 3: Three-panel proof — (left) CNN-clipped TikTok 'Thousands of workers around the world are helping develop humanoid robots by filming themselves doing household chores. CNN's Stephanie Hare reports' — caption 'are stripping romance'; (center) two stills of egocentric hand-pose / fruit-cutting from CNN; (right) grid of first-person robotic-imitation data clips. Source: cnn.
  • [t=6:12:00] Slide 4: 'Why did human egocentric video become the dataset for robotics?' — pivot slide.
  • [t=6:13:30] Slide 5: 'Future we believe...' — two-panel choice: 'Robots as utility — Machines in warehouses, on factory floors, doing repetitive tasks' vs 'Robots as companion — Robots that live alongside us, share our homes, work next to us, understand the world the way we do'. Argues the companion future is what justifies the data investment.
  • [t=6:14:40] Slide 6: 'When we were babies...' — image of a toddler grasping/dropping/mouthing objects on a play mat. Setup for the sensorimotor-learning analogy.
  • [t=6:15:15] Slide 7: 'We learned from action & feedback' — same baby overlaid with a circular diagram: 'THE SENSORIMOTOR LOOP — LEARNING THROUGH INTERACTION' with arrows for SENSORY INPUT (vision, touch, proprioception), ACTION, FEEDBACK. The core argument: physical intelligence comes from closed-loop sensorimotor learning, not from passive observation.
  • [t=6:16:30] Slide 8: 'What's blocking touch?' — left panel shows existing capture stack: Vision | Motion | Proprioception | Audio with images of an egocentric camera rig, an iPhone, and an RGB colour wheel. Right panel: 'Touch — ?? no existing tool'. The thesis in one slide.
  • [t=6:17:30] Slide 9: 'We need the camera moment for TOUCH' — three bullet criteria: Highly scalable / Consistent signals / Standardized outputs. The whole pitch is 'be the RGB camera for tactile.'
  • [t=6:18:20] Slide 10: 'OpenGraph Labs is building the standard for TOUCH' — backed by a stylized humanoid image. Brand-positioning slide.
  • [t=6:19:00] Slide 11: 'We enable this by building hardware and software together' — two-panel product map: (left) 'Highly scalable hardware — Wearable tactile capture glove / x7 cheaper than current options / Consistent sensing signals' with image of beige glove and a heat-map readout of fingertip contact; (right) 'Tactile encoder — A model trained on our scaled tactile dataset / Ingesting raw tactile signals and turning them into meaning robots can learn from' with a screenshot of the dataset-capture rig. The full bull case in one slide.
  • [t=6:19:35] Slide 12: Speaker-cam — next speaker (Suveen, Cortex AI) walking on, Julia exiting. Transition frame.
  • [t=6:19:42] Slide 13: Stage-wide cam, AIE Singapore logo, transition between talks.
  • [t=6:19:48] Slide 14: Cortex AI demo reel begins — 4-panel grid of autonomous robotic arms doing pick-and-place in what looks like a convenience-store aisle, labeled with autonomy + speed markers. Sets up the next talk on robotics data ops.
  • [t=6:19:55] Slide 15: Next speaker's title card — 'Full-Stack Robotics: Lessons from Building Across the Robotics Stack / Suveen Ellawela / Founding Engineer @ Cortex AI'.

Full transcript (1429 words)

Julia, co-founder and co CEO of Open Grab Labs. Uh today I want to talk about how our everyday human experiences can actually become useful training data for next humanoids. So how many of you have heard the term egocentric data? Yeah, I can see a few or maybe you've seen this fire video recently at apps.

Factory workers are wearing the cameras on the hat uh while they're working. So over the last year something very strange has been happening in the field. Hundreds of companies have started collecting the human behavior data at scale. People filming their first point of view um cameras doing their daily task and actually got incentivized for doing that.

So why are we doing this? So why did humans suddenly become the core data sets for robotics? So this is because we just got the proof that it actually works. Nvidia's recent ego scale research show that scaling human egocentric data actually helps the robot training. So they do use the egocentric video as a pre-training pre-training data set for their model and fine-tune on a human rob alignment data set also with a few teleoped uh robot only data and the robot can actually do the task like folding a shirt with a one one shot transfer and as the same way the language model u scaled with a with a putting more data they also show that uh it's also can be workable for AI physical AI too.

So it showed a significant scaling low not just because it was proved to be useful for pre-training but actually to be honest the egocentric human videos are fundamentally very important with two aspects. First, we are now building the human level capable robots. That means that same form factor they looks like us and similar degrees of the freedoms and that means that we are trying to minimize the embodiment gap between the human and the humanoids and is actually actually getting closed very fast and as the gap go as the gap closes the human behavior actually can be directly transportable to the robot that which is the most direct super visa signal uh possible in the world and secondly the egocentric data is captured in the real world as it actually is. The physical world as we all know is continuous uh it's dynamic and physically grounded.

So every data we got from the egocentric data is actually very very high fidelity data uh and it includes the much more information that any robot could ever learn from. But then uh are we really done now? Uh so we can just have more egocentric video data and we can solve more all the problem. Uh so simply collecting enough human video data there is some bad will robot eventually achieve the human level physical intelligence or not.

Well I do think that this actually depends on which future you are building towards and that future defines the level of intelligence we might need for robots. So one future is robot as a utility. So tools in the warehouses, arms in the factories, machines that do the task, but they don't share the space with us. And the other future robots that actually live with us, they fold our laundry at our home and that also help our parents to to companion our parents and they hand us the glass of water.

Uh and which means that they actually share our world and if we want them to live with us they need to be physically intelligent. So they need to learn the word the same way how we did. So then let's go back to something very fundamental. How uh think about how did we actually first learn the words when we were babies.

We grasp the things, press the things and drop the things and touch the things, pull the things sometimes or actually many times we actually put something in our mouth to taste it. We learn the word by interacting with it and we learn through the actions and feedbacks through the touching the word and observing how it actually responds after my actions and this is what we call the sensory motor learning. So the nature question uh follows that if human sensory motor learning itself is what forms our physical intelligence then what if we could do the same thing to robots. We let the robot learn the same way that we learn as a babies.

So again this is the same baby from the last slides uh is actually producing and generating all of these sensory motor signals at once. Vision, touch, propriception, audio, action and feedback loops. And through those interactions, the baby gradually learns the structure of the physical words. So the question becomes now if we could capture all of these data and train and use as a training data set for robotics, we can make we can let the robot exactly mimic like us uh and learning everything on top of here.

And yes, I truly believe in that future and we can achieve this by sensorizing the humans. Today, many parts of the human sensory motor loop are already becoming very measurable. We already have the vision system captured through the egocentric cameras. We can also reconstruct the motion information directly from the video and also pro pre propoception like a 3D hand pose, wrist pose and uh trajectories body motion reconstructions those are also can be reconstructed from the video and audio is also very naturally captured through the camera system.

So and so now we it's very um very obvious that to see that only one critical modality uh now we are missing largely is the touch and for physical interaction as we all know touch may be the most important signals that we should collect from the real world. So one reason we still have very little touch data today is that many other human signals are already capturable and interpretable from firsterson vision alone. So the egocentric video. So with egocentric cameras we can already infer motion trajectories, hand pose, body movement, action structure and even proper obsession.

And honestly, this is probably the moment to thank decades of the progress in the camera hardware system and the entire ecosystem built on top of the standardized RGB system because once the world convers around the RGB cameras, computer vision became scalable and now we are right now waiting for that exact moment for the touch because the touch never had that moment yet. So we have to follow how the video system improved how VA scaled because they were converged around the one thing the camera and the RGB pixels. We also need a unified hardware stack that everyone could build on on for touch data and also build a data infrastructure which which share the same format of the data and this is why we exist. Open grab labs here is here to build the standard for touch the missing piece of the sensory motor system so that with this and we we can get finally leap forward in robot learning.

We enable this with two main layers. First the highly scalable hardware that produces the high fidelity contact signals from the fingertips and secondly on tactile encoder which is an interpreters built on top of that hardware ingesting the tactile signals and turning them into meanings. So with a high scalable hardware we can a we are able to capture scalable data set and on those data set we are now able to build a meaningful encoder tactile encoder and so we've just started building a complete pipeline for capturing the full human sensory model loop uh making it trainable for the first time. Thousands of people, millions of interaction, every moment of contact between the human and the physical interaction can be now captured, digitized, and ready to be teached for the next generation of the robotics.

Let's train the human noise by sensorizing the humans. Thank you. That was sick talking about scaling out human data collection for touch. Now a huge part of actually collecting data is that we actually need to scale up the data operations, right? We don't just need to collect the data. We need to ensure that we have the operators, we have that entire infrastructure and logistics handle.

And so we have Suen from Cortex where they talk a lot about how they do this at scale with robotic and other forms of data.

Full-Stack Robotics: Lessons from Building Across the Robotics Stack

Suveen Ellawela, Founding Engineer at Cortex AI (Singapore robotics startup building bimanual + mobile manipulation; partnered with Allen Institute on MolmoAct 2)

Summary

  • Cortex AI is a Singapore robotics shop running bimanual YAM arms on manipulation tasks (folding towels, pouring milk, retail picking) and mobile robots in convenience-store environments — policies that take pixels-plus-joints in and emit actions in a tight observe-act loop with top + wrist cameras.
  • Frames the modern robot learning stack as six layers: data, software, hardware, models, evaluation, operations — and his core claim is that robotics is not primarily a model problem, it's a data and operations problem.
  • Cortex co-produced the MolmoAct 2 dataset with Allen Institute for AI: 720 hours of bimanual YAM teleoperation data, the largest open bimanual dataset to date — pure data-layer moat, not model IP.
  • Data pipeline lessons from scaling teleop to hundreds of hours: decoupling encoding from collection (move encoding to session-end, then hardware-disconnect so a new session runs while prior episodes encode/upload) compounded throughput; high-quality data 'starts before recording' — task strategy, motion practice, cross-operator consistency all gate dataset quality.
  • 'Breadth matters': problem layer and solution layer rarely match — broken gripper looked like a model/code bug but was solved by designing custom hardware; tilted-camera data loss looked like a hardware bug but was solved by writing a pre-session software tool that validates camera view before recording.
  • Real-world evaluation is the hard part — sim is parallelizable but real-world has lighting drift, distractors, actuator/camera noise. For MolmoAct 2 they ran thousands of real-world rollouts across five policies. Every rollout has a manual reset cost (no parallelism), and failure attribution is a nightmare: data inconsistency, training setup (LoRA vs full FT), partially-initialized weights, wrong action-chunk size, or eval-setup drift can all look the same from the outside.
  • Safety during development (not just deployment) is underrated: leader-arm dies dropping full weight on the data operator, robots smashing test tubes with glass shrapnel flying, stale action queues causing sudden 90°-in-half-a-second arm movements.
  • AI-written code on robots discipline ('Run it like it can fail'): they use LMs heavily to scaffold around LeRobot (HuggingFace) for new arm integrations, but layer offline checks (types, unit tests, business logic) → replay/sim → log-only execution (send actions, don't actuate) → single-joint slow-speed actuation on real hardware before full run.
  • Closing takeaway: you don't have to be expert in every layer, but cross-layer fluency is what lets you debug and ship faster — Cortex is hiring full-stack roboticists (cortexrobot.ai/careers).

Slides (21)

Slide notes

  • [t=6:19:44] Slide 1 — Title: Suveen Ellawela, Founding Engineer, Cortex AI. 'Full-Stack Robotics: Lessons from Building Across the Robotics Stack.' Grid of robot clips: bimanual arm autonomous (2x/4x speed), convenience-store teleop (2x), bimanual arm with mug and cup (2x), folding green towel (4x). Establishes the product: bimanual manipulation + mobile retail robots.
  • [t=6:20:20] Slide 2 — 'How did these robots get so smart?' Single image: bimanual arm pouring last drop of milk into red mug. Setup question for the architecture diagram.
  • [t=6:20:55] Slide 3 — 'From pixels to actions.' Observations panel: top camera + left wrist + right wrist feeding into Model box. The architectural primitive: pixels-plus-joints in, actions out, top + wrist camera convention.
  • [t=6:21:30] Slide 4 — Layer list (vertical): data / software / hardware / models / evaluation / operations. The full-stack frame: six layers Cortex thinks across.
  • [t=6:22:00] Slide 5 — 'Robotics is also a data and operations problem.' Centered text slide. The thesis statement of the whole talk.
  • [t=6:22:25] Slide 6 — '720h MolmoAct2 Bimanual YAM dataset / The largest open bimanual dataset to date.' Grid of teleop scenes + HuggingFace card screenshot + QR. Citation: https://huggingface.co/collections/allenai/molmoact2-bimanual-yam-dataset. The data-moat receipt.
  • [t=6:23:00] Slide 7 — Teleoperation demo image: human operator wearing leader arms, robot follower arms in front mirroring on a pink mat. 'Even though it looks fun it's actually very hard' — human intuition doesn't transfer to new embodiments.
  • [t=6:23:45] Slide 8 — 'High-quality data starts before recording.' Image: bimanual arms folding a green towel. Task strategy + motion practice + cross-operator consistency gate dataset quality before a single frame is captured.
  • [t=6:24:30] Slide 9 — (Workflow optimization, audio-only). Story of decoupling encoding from collection: 2-3 min per-episode encoding → end-of-session batch encoding (30-40 min wait) → hardware-disconnect code change → fully decoupled collection / encoding / upload pipelines.
  • [t=6:25:30] Slide 10 — 'Breaking Grippers.' Image: bimanual arm reaching into a basket, gripper component on floor. Setup for 'problem layer ≠ solution layer.' Looked like a software/model bug (lower gripper force in code); actual fix was hardware (design custom gripper).
  • [t=6:26:30] Slide 11 — 'Moving Cameras.' Painted-wall scene with tilted top-camera circled. Inverse case: looked like hardware (rigidify the mount); actual fix was software (pre-session camera-view validation script).
  • [t=6:26:55] Slide 12 — 'Moving Cameras' (continued). Right panel: live-vs-reference comparison tool with 'Live: 100%, Reference: 0%, FPS 30.0' — the camera-view-validation tooling Suveen built.
  • [t=6:27:25] Slide 13 — 'Evaluating Robots.' Simulation (monitor icon) vs Real-world (robot-arm icon). Sponsor strip: Diamond Sponsors / OpenAI. Sim is parallelizable; real-world is where it gets messy.
  • [t=6:27:55] Slide 14 — Single bimanual arm on desk with mug + glass cup, ready for a rollout. Lead-in to thousands-of-real-world-rollouts story for MolmoAct 2 across five policies.
  • [t=6:28:25] Slide 15 — Three-panel real-world rollout sequence (cluttered desk with mug, glass cup, plastic-wrapped object). Visual of one rollout in progress — the kind of run that has to be reset by hand if it fails.
  • [t=6:29:05] Slide 16 — 'Every rollout has a reset cost.' 9-panel grid of identical bimanual-arm-with-red-block scenes; center panel shows a human hand reaching in to reset. The single most quotable image of the talk — no parallelism for real-world eval.
  • [t=6:29:50] Slide 17 — Mobile bimanual robot on wheeled base / lift column. Bridging to safety: leader-arm-dies-dropping-weight, broken test tubes / glass shrapnel, stale action queues causing 90°-in-half-a-second movements.
  • [t=6:30:30] Slide 18 — 'Running AI-written code on Robots.' Image: research scene with researchers + robot dog + green ball on artificial turf, captioned anthropic.com/research/project-fetch-robot-dog. Setup for the LM-scaffolding-around-LeRobot discussion.
  • [t=6:31:00] Slide 19 — 'Run it like it can fail' (title only). Setup card.
  • [t=6:31:25] Slide 20 — 'Run it like it can fail.' 1: Offline checks — type checks, unit tests, manually check if business logic is correct. 2: Replay/Simulation — test on logs or in simulation when possible. Terminal screenshot of joint-position log lines on the right. (Talk implies a 3rd/4th stage: log-only on robot, then slow single-joint actuation.)
  • [t=6:32:00] Slide 21 — 'Biggest takeaway' (with partial bleed-through showing 'speed/move'/'operator stop' from the prior safety slide). Closing: cross-layer fluency beats single-layer depth for shipping speed.
  • [t=6:32:30] Slide 22 (Thank-you) — 'We're hiring people who like working across the robotics stack. cortexrobot.ai/careers. Thank you!' Two QR codes: cortexrobot.ai and X @suveene. Right: teleop image of operator on mat with bimanual arms (Cortex-branded mat). Hiring pitch + handles.

Full transcript (1714 words)

Hi everyone, I'm Suin. I'm from Cortex AI and I'm a founding engineer there. Today I'll be speaking about some of the cool things we got these robots to do, some of the challenges we faced and some of the lessons we learned. Here you can see some of the robots that we work with. We mainly work with bmanual robots doing manipulation tasks and we also work with mobile robots doing uh task in more realistic environments like convenience stores.

And you might wonder how these robots got so smart. Even in this clip you can see it's pouring the last drop of milk to the cup. Actually this learning systems they just take pixels in and they output actions. Usually we have a top camera and wrist cameras. We also passed in the joints data of the robot.

A simple language instruction. Then the model will predict some actions. We execute actions on the robot. You go to the next state and the loop continues. This diagram is actually a really good way to think about the modern robot learning stack. You have camera beams and joints as data. Software is powering uh data collection, training, inference.

Hardware is arms and cameras. Models is models are what policies that we run and to test if these policies are working you need evaluation and to make this happen again and again you need a good operations layer. Robotics is often regarded as a hardware problem or a soft or or a model problem but it is also a huge data and operations problem. Recently we worked with Alen Institute for AI on their Mulmo act 2 paper and we collected over 700 hours of bmanual yam data for their data set and it is the largest open bmanual data set to date and we collect our data through teleoperation.

Here you can see my colleague he's controlling what we call lead arms and the follower arms will copy the motion and even though it looks fun it's actually very hard. The main reason is human intuition. It does not transfer really well to a new embodiment. You know how to grab a cup with your hand. But not when you have to think through a robot arm, it's really hard.

But it's it's a learnable skill. And not just that, there's a lot more to be done before you start collecting data. Even the simplest task of folding a towel, you can fold it in two, you can fold it in three. You have to you have to come up with a task strategy. After that you have to practice the motion.

Then you have to make sure the data collected is consistent across episodes and across different operators as well. When we started scaling these data operations to hundreds of hours, we realized some small workflow changes we added. They started to compound. Initially we had the friction of waiting for two or three minutes for each episode to be encoded.

Then we move the encoding process to the end of the session. Then suddenly the whole encoding duration is much longer. Now we had to wait for good 30 40 minutes before we start the next session. Then what we did was we made a small code change. We disconnected all the hardware. So you can run a new session while the previous episodes have been encoded.

And what ended up happening was data collection and encoding and uploading processes they became completely decoupled. Another thing I want to talk about is how breath matters in robotics. By breath, what I mean is being knowledgeable across different layers in the stack and being able to operate up and down in the robotic stack. The reason I'm saying is I've realized the problem space and the solution space might not be in the same layer in mo most of the times and the more intuitiveness you have across the layers, it's much easier for you to solve problems faster.

Let me explain this with a few examples. So when we started running policies on these robot arms, there was a task where the robot had to grab a jar and the grippers broke and you can see the clip the grippers flying off. And I thought, okay, maybe the model learned something wrong or I could just lower the gripper's force from code. But my colleague, he said, okay, let's just design our own gripper.

We were we were working with third party hardware, but we could still innovate on top of that. And this is a good example of a hardware solution for a problem which I thought is in software. Similarly, whoever like worked with these cameras, you know, they get disconnected often and you unplug it in unplug it, then plug it back in, then it starts working magically. Then one of these times, one of our operators tilted the camera accidentally and the top camera view was off.

So the all the data we collected that day we had to throw away because it was not in the correct view and we were trying to make sure the camera mount is more rigid but I coded this I w coded a script a tool to check if the top camera view is good. So what we did was every session we take two or three minutes at the start then we check if the camera view is correct then we can make sure the data we collect is actually good. So this is a good example of a software solution for for a problem that we thought is in hardware that that's why moving across the stack and thinking from all these layers actually helps a lot. I also want to talk about why evaluations in robotics is hard.

Similar to software you can eval evaluate robots in simulation and you can paralyze that. But real world is where things get messy. For example, lighting could change. There could be distractors, there could be actuator and camera noise. So you have to account for all those things. Recently when we worked with Malm act when we work on malmarmac 2 we ran thousands of real world evaluation rollouts across five policies and that taught us like how hard of a problem this is.

So when you run real world evaluations, this could happen when it's a failure and it would happen again. In robotics after you run every roll out, you have to reset the environment manually. Not like in software you can run parallelly. You have to manually go and clean it up if it makes makes a mess. And I've done this hundreds of times and I can guarantee you it's not fun.

Uh then we realize it's very expensive to do this all the time but that is the gold gold standard as of now. Another hard thing about evaluation is when a robot fails to do something it's really hard to figure out where it fails. Let me let me explain with a few examples. It could be the data. Maybe different operators use different strategies.

Maybe I folded it in two. Someone else folded the towel in three. Uh maybe it's a training setup. You the adaptation you wanted for example Laura versus full fine-tuning then it could be the setup I've had scenarios where I tried to load a model and some part of the model got initialized with random weights and the model is like going haywire and it could be the wrong action chunk size as well compared to the what the size that you used in training and maybe the evaluation setup itself could be wrong maybe you are trying to evaluate in distribution but the placement of the object object is slightly off.

Lastly, I want to talk about safety. This clip is something I accidentally recorded. You can see the joints doing a 90° in less than half a second. And if someone else's hands were there, they would have gotten hurt. We really talk a lot about robot safety when it's deployed, but I think there's a lot lot of safety concerns when it's developing as well.

Uh I can say like in data collection if the leader arm suddenly dies which happens sometimes the whole weight might be on the data operator. In evaluations we had cases where we are testing a task which involves test tubes and one of the robots they broke the test tube and you have like glass pieces going around u and stale action cues that might lead to sudden arm movements which is also a safety concern. And there's much more like this. I also want to talk about running AI written code on robots because especially AI coding tools are becoming mainstream.

Uh to give context one scenario that we use uh AI coding tools for robots is basically when we are using lay robot we are huge fan of layer robot from hugging face. So when we want to adapt that library to robot arms that we work with there's a lot of scaffolding a lot of interface work that we need to be done. So we use AI to do that and move faster. But when we try to run it, we run it like it can fail.

Uh we do the normal software checks, fundamentals, normal PR reviews. Then we try to check in simulation and we try to test in logs. You can send the actions to the robot but not execute them. Just look at the logs first. Then when you want to test it on the actual robot, you can just move one joint at a time.

You can slow the speed down. Uh yeah, these are some of the things that we follow. Yeah, one thing I want to emphasize is that you don't have to be an expert in every layer of the stack, but if you have more knowledge about different layers, it's really easy to solve problems and move faster. That's it from me.

Thank you. Okay everybody, um that is the conclusion. Um thank you Savine by the way. Thank you so much. Um this is the

Keziah: guided sensory meditation with a vibe-coded particle visualizer

Keziah (phonetic 'Kazaya' in the MC intro — Singapore-based consultant by day, mindfulness coach on the side; no prior coding background, vibe-coded the on-stage particle visualizer in ~4 weeks using off-the-shelf GitHub repos)

Summary

  • Programmed as the Day 2 mid/late-afternoon break replacement: instead of the standard 'AIE logo + lobby music' interlude, organisers (MC Sherry introducing) handed ~15 minutes of stage time to a guided sensory meditation paired with a custom particle visualizer projected behind the speaker.
  • MC framing was load-bearing for the segment's meaning: 'we wanted to put thought into every minute of the programming... and this is very much in the spirit of AI engineer and this changing definition of what a builder and engineer is' — explicit positioning of the vibe-coder-with-no-CS-background as the new default 'engineer'.
  • Builder story: Keziah is a consultant + mindfulness coach who wanted a systematic way to bring meditation into high-overwhelm settings; she had no coding background, was 'AI-pilled' by the AIE SG circle, then went and assembled a particle visualizer from GitHub repos and stitched it into a live guided-meditation experience in roughly four weeks.
  • Stage delivery: lights down, ambient/chant audio loop ('hey, hey, hey...' captured in the transcript is the meditation soundtrack, not stage banter), particle field running on the back screen, soft pink illuminated balloon-like sensory props placed on stage and at the front of the audience (visible in the wide shots). The talk is ~90% experiential — only ~880 transcribed words across 27 minutes, the bulk of which is the MC intro and the post-meditation handoff to Jay Demetillo.
  • No technical deep-dive was given on stage. Stack is inferred from the visual (browser-rendered particle system, likely Three.js / WebGL or p5.js fork from a public GitHub repo, driven by audio-reactive parameters) — the artifact is the experience, not a slide on architecture.
  • Tail of the segment (transcript minutes 19-27) is MC banter and stalling while the next speaker Jay Demetillo (ex-Canva) sets up his terminal-prompt title slide — useful context that this is the conference's late-afternoon energy-management slot, not a featured tech talk.

Slides (8)

Slide notes

  • Segment runs from t=23415s (6:30:15) to t=25056s (6:57:36), 27.4 minutes total. The visual artifact is the live particle visualizer + sensory installation, not a slide deck — only 8 frames were extracted and most are stage/speaker-cam captures rather than info slides. The transcript-light, image-heavy nature of the talk is itself the documentation. Frames as captured: [t~6:30:30 / slide 01] Wide stage shot: AIE SG city-skyline backdrop, podium with 'AI Engineer Singapore' branding, three figures at the lectern (MC + crew + Keziah in dark green/teal coat), a soft glowing pink balloon-like sensory prop on the floor downstage right. The sensory props are the physical analogue to the digital particle visualizer. [t~6:32:00 / slide 02] Mid-shot of the MC (woman, dark jacket, lanyard) on mic centre-stage with crew flanking her. This is during the intro framing: 'Kazaya is someone who has a day job just like the rest of us working in consulting but she is also a mindfulness coach...' — the slide showing here is the MC introducing the builder story, not Keziah herself yet. [t~6:33:30 / slide 03] Three figures around a closed MacBook on the podium — Keziah (centre, dark green coat, headphones around neck) flanked by two crew members in headsets — setup/cueing the visualizer right before going live. The 'AI Engineer Singapore — Organised by 65 Labs' brand panel is visible. This is the operational moment of a no-code vibe-coder cueing her own custom-built stack on a major-conference main stage. [t~6:35:00 / slide 04] Speaker-cam of Keziah on mic, podium-side, smiling, beginning the guided meditation. The Singapore-skyline backdrop is visible behind her. This is the segment's hero frame — the talk's centre-of-gravity is her on stage, the visualizer running behind, no slide deck. [t~6:40:00 / slide 05] Wide stage shot mid-meditation: AI Engineer Singapore wordmark projected centre-stage on otherwise dark screen, giant illuminated white 'AIE' letters lit on stage-right, Keziah on mic on a small downstage circle, four glowing pink sensory balloons on the floor in a row across the stage. Two crew silhouettes at stage left. This is the closest the segment has to a 'thesis slide' — minimal type, maximal stage-as-instrument. [t~6:48:00 / slide 06] Speaker-cam of the MC (man, dark jacket, cap, mic in hand) doing the post-meditation handoff to the next speaker Jay Demetillo. Singapore skyline backdrop, giant 'AIE' letters visible at stage-right edge. Transition into normal conference programming. [t~6:52:00 / slide 07] CONTENT SLIDE — the next speaker Jay Demetillo's title screen projected behind the MC: 'AIE_Singapore_May_17_2026 / JayDemetillo@TBA ~ %' rendered as a terminal prompt with a blinking cursor. Note: this is technically the next segment's opening frame but was captured in the seg_27 extraction tail. Not part of Keziah's content. [t~6:56:30 / slide 08] Speaker-cam of Jay Demetillo (blue shirt, glasses, mic) walking onto stage — fully into segment 28 territory. The Keziah segment has effectively ended by this frame. Net: no info slides exist for the Keziah talk.

Full transcript (884 words)

Stay for that. Um and while while this break is happening um I want to give a little bit of context to my friend there in green called Kazaya. Um just wave. Uh so Kazaya is someone who uh you know has a day job just like the rest of us working in consulting but she is also a mindfulness coach and wants to find a way to be able to bring more people into that kind of practice especially in spaces where there's just so much going on that a lot of us can feel things like overwhelm anxiety and just want to find a systematic way to be able to take a pause and just be able to kind of you know slow things down a little bit.

And that is why we wanted to create a little bit more of a curated experience for the breaks versus, you know, putting up AIE logo and some music and let you guys have coffee, right? Um, we wanted to put like thought into every minute of the programming. So, um, that's why we brought her on. But I also wanted to share another story about how this all got started because I think it's very much in the spirit of AI engineer and this like changing definition of what a builder and engineer is.

Uh Kazaya actually with no background in coding actually vibecoded this entire experience. She found GitHub repositories that helped create the particle visualizer on the screen that you're going to see and she did that all in the last four weeks. I mean I think you know we kind of pilled her on AI and then she just like went and you know went ahead and decided to um to build it. So, um I couldn't be more happier seeing people who are in all different kinds of, you know, spaces, uh industries, like being able to be empowered with these tools and just create these amazing things, right?

And um all of this is possible to be able to connect things like uh meditation and mindfulness to an actual tech experience that we get to be able to show on stage today. So, um do kind of enjoy uh the next uh 15 minutes or so um you know to kind of slow things down and uh you know get that little less reserve of energy before we finish off the day. Thank you. Hey, hey, hey.

Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey, hey, hey.

Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, Hey, hey, hey. Hey everybody. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey. Hey, hey, hey, hey. Hey, hey, hey. Hey, hey, hey, hey. Hey, hey, hey. With our programming really quick.

Uh and and while our next speaker Jay sets up, uh it's going to be an incredible talk. I've been looking forward to it. I spoke to Jay a little bit behind the scenes and this is a talk about humans about the human side of AI where you if you may work in a team and people are wanting to get in on AI, they're wanting to uh level up, they're wanting to design for other humans, often times we we end up with generic prompts and generic results and and we may not even know how to really get the most out of it. And so this talk from Jay uh comes to us previously from Canva Excanva is going to talk to us about that and I'm very excited about it.

So, if you're ready, if you're feeling restored, if he's Is he ready? He's not yet ready. No. He He answered me. He just said no. That's so nice. I can kind of see no. While While they get ready, how are you feeling? You good? Show me by applause level. Okay, that's nice. Very good. That's good. I'm happy. It's It's a good conference.

And it's sad that it's almost coming to an end. If you want more even after the end to say a You don't want more? I don't think we won't do it next year. How's that? I'm joking. I'm joking. It's fine. They're freaking out backstage. Can he say that? I don't know. We gave him a mic. Um, this is going to take a while, huh?

This is the worst part about being an MC. Now I've got to like think about entertaining you all. But that's easy, right? &gt;&gt; Hey, thanks. He said I'm What's your name, sir? Ari &gt;&gt; art. &gt;&gt; Test test. &gt;&gt; His name is Art. Jesus. The guy's a piece of art. You ready? &gt;&gt; I think we're ready, man.

Your biggest round of applause. Let's go, baby. What's going on? Wake up. Come on, baby.

Prompts Don't Have Opinions. You Do. — Prompt Fatigue, Human Context, and the Design Judgment AI Cannot Replace

Jay Demetillo, Principal Designer, Ex-Canva (ex-Grab) — shipped Canva Sheets and column/layout interactions to millions; talks under the banner 'Prompts Don't Have Opinions. You Do.'

Summary

  • Opens with a Jon Snow jab at 'design influencers who haven't shipped to millions' — frames the talk as an anti-bullshit corrective to prompt-tip-thread culture (his term: prompt fatigue / 'tokenmaxing slop').
  • Anchor analogy: General Mills 1947 cake-mix flopped when it required only water; sales took off in 1949 when bakers had to add an egg. The IKEA Effect — co-creation lifts perceived value (cites a study showing 3.7 for AI-generated images alone vs 4.7 for human-collaborated). AI products that win let the user keep a meaningful step.
  • Working metaphor: 'AI is my intern, not my art director.' Delegation of judgment to AI is the failure mode; delegation of grunt work is the unlock.
  • Frames user adoption around loss aversion: average non-tech users are hesitant on AI unless it is sold as time-saved, not capability-gained — and most products still don't communicate failure states like humans (Canva Sheets replaced spreadsheet '#ERROR' with plain-language suggested fixes; competitors haven't).
  • Core thesis — 'design outside the dataset': AI builds the happy path; humans break it. Voice assistants fail on road noise, crying babies, contextual switching — environments LLMs were never trained on. Complex interactions (Canva columns/layouts with Simon Lynn) still demand on-the-ground prototype testing with real users.
  • Design-as-moat argument echoing Josh Newton from earlier in the day: AI speeds up the loop, it does not replace craft or judgment. Dyson's 5,127 vacuum prototypes and Apple's keyboard team are the canonical cases — design is the driving innovative competitive advantage and the only durable lever above a commoditizing model layer.
  • Operator move he advocates: build internal code templates of your product surface so non-designers can drop into Cursor/Claude during workshops and prototype ideas — designers should empower, not gatekeep.
  • Closing 'hot takes': (1) mute AI-design slop on social; (2) talk to people outside the tech bubble — users don't care about your features, they care about their needs; (3) leaders must give teams space to adapt rather than bullying designers over screen-count KPIs; (4) be AI-fluent, not anti-AI — but be anti-bullshit. Cites Christina Koch / Artemis: 'find your crew.'
  • Quotable line he flagged: SpiceJet CTO — 'AI is very expensive right now, but the overhead of hiring people is cheaper' — inverting the usual narrative of AI as the cheap labor substitute.

Slides (20)

Slide notes

  • [t=6:57:36 / start] Slide 1: 'In 1947 General Mills cake mixes initially required almost nothing from the baker; they asked customers just to add water… Sales were poor.' / '2 years later, they let their customers add a fresh egg. This made bakers feel invested. The real insight was adding a step back in, not removing one, and ensuring people felt like they had control.' [t≈6:58:00] Slide 2: 'People value AI outputs more when they help create them' — The Ikea Effect. Bar chart: 3.7 (AI generated image — no collaboration) vs 4.7 (Human collaborated with AI to make an image). Source: 'The IKEA effect in human-AI collaboration.' Lower banner shows speaker ID: 'Jay Demetillo / Principal Designer, Ex-Canva / Prompts Don't Have Opinions. You Do.' [t≈6:58:45 → 7:01:00] Slides 3–5: speaker-cam, projector glitch — Jay continues talking through the cake-mix → IKEA-effect → 'AI as intern not art director' arc and the loss-aversion framing for normie adoption; no new slide content. [t≈7:01:30] Slide 6: a Figma/freelance-design surface visible behind the speaker showing 'Sample Works — Logo and Brand Kit for Washimi Laundry / Marketing Collaterals for Pourover PH / Web Design for Sagip Dalisay Foundation' with PHP pricing (Logo PHP 5,000, Brand Kit PHP 2,500) — appears to be a tangential reference example, not a main slide. [t≈7:02:00 → 7:08:00] Slides 7–18: speaker-cam frames during second projector outage (Jay says 'classic' and 'we got this'); content delivered verbally — Canva columns/layouts case study, Simon Lynn shoutout, 'design outside the dataset' framing, Dyson 5,127 prototypes, Apple keyboard, Huawei smart-home voice context, SpiceJet CTO 'AI expensive, headcount cheaper' quote, four closing hot takes ('mute the slop', 'talk to people outside your bubble', 'users don't care if your product is better — design for needs', 'give your team space to adapt'), Christina Koch 'find your crew' close, 'be AI-fluent, anti-bullshit' kicker. [t≈7:08:30] Slide 19: closing AI Engineer Singapore wordmark over the SG skyline outline — Jay yields to Usman (13-year-old co-MC; built a Muslim prayer-tracking Quran app on Google AI Studio; 'vibe coding is not that hard, you just need to put in the hours') who introduces Alex Lee (next speaker, on AI design systems). [t≈7:09:30] Slide 20: stage clears; Alex Lee walks on for seg 29.

Full transcript (2074 words)

&gt;&gt; I'm Jay. I used to be at Canva. I used to work at Grab. How you doing? Today, I'll be talking about prompts don't have opinions. You do. So, for context, right, I'm really tired of the that's being put out there. Prompting ain't is but slapping tricks. So, I have my notes on my phone. If I'm looking at my phone, it's not that I'm looking at an agent, but I'm looking at my notes.

So, I'm tired of these design influencers, these leaders, these people that have positions and high power that talk about the design process, yet they haven't done anything or shipped to millions of users. Like Jon Snow, they know nothing. So, take it from me and the people here who have actually built something for millions of people. And let's talk about this.

Oops. Oops. So, historically, right, uh I think there's a parallel between AI and actual products that have been built. So General Mills, a baking company in the states, in 1947, they released a cake mix and people weren't really vibing with that in general. And when they added an extra step, just adding an egg to that instant mix, people were invested.

They felt like they were creating, which is so interesting, right? Because it's the same with AI. I think for for anybody that's designed for AI in general, people create value when AI outputs it and helps them, right? And it's it's called the IKEA effect. People are going to be invested when AI is actually collaborating and acting as a partner.

Uh, and it's it's fascinating with products that you've seen out there, whether it's Canva, Google, Figma in general. You let people have the choice to either use AI or edit or generate with you. Some people, you know, obviously a little bit hesitant. You see it with them not trying to use tokens, but it's fascinating, right?

Oh, it's gone out. Is it because All right. Awesome. Amazing. Cool. We're back. We're back. I hope. Keep it active. Oh, we're down. &gt;&gt; When something like this happens, just give him a huge round of applause to not make it awkward. There we go. &gt;&gt; It happens. It happens. No worries. So, I'm going to keep going.

We're good. Awesome. So, how do I work with AI? And how do most people work with AI? For me as a designer, I use it as my intern, not my art director. So, you've probably seen this video on LinkedIn. It's fascinating, right? Uh people are token maxing and using all their these their tokens. It's sloppy. Um and it's the same with cloud code uh in general, right?

You use up all your tokens, you get upset, you're like, "Ah, dang." Like, I you you lose all my credits. It's expensive. Why would I want to build? So, I guess I would ask most people, right? Would you delegate decision-m to a human expert or AI? I guess when you're designing for real people and you know I encourage people to talk to people outside of the tech bubble that you're in because average people are hesitant to use AI.

So if you frame it in a way where people are actually spending less time to think about things then people are more inclined to potentially use AI in general which is fascinating right and they did a study on this. People are more inclined to use AI uh when you frame it as a loss of time uh and speed in general and we we did this right so for me I built canvas sheets uh AI powered spreadsheet and it's fascinating to me when there are other competitors out there that still use this hashtag error and it doesn't really communicate what's actually wrong uh and if you talk to normal people that use spreadsheets on a bas daily basis they're overwhelmed by this right so at Canva we try to make it easy for someone to use formula and we give them and talk to them as a human being to give them a suggested fix which is fascinating in itself right and it's the same with the voice uh assistant stuff that I've done as experiments as well AI builds the happy path uh as you've seen uh humans break it they don't care right and you can't prompt for environments whether you're outside dealing with road noise babies crying right and it's unfortunate because I think a lot of people if you've built for voice it's so expensive to go on the wrong path. Uh, and if you've ever done it before, it's just hard to do in general. So AI can't solve everything.

So I encourage you to think outside of the data set, right? I think this graph you've probably seen a lot. So when you're designing, if you're an entrepreneur or even designer or creative or dev, think about what is the driving innovative competitor advantage that you have and who drives that? Design. Design has always driven this value to have an advantage on competitors in general.

So James Dyson is also a good example of this, right? He's prototyped 5,000 and 100 vacuum prototypes if you read about his story and he didn't get a call until one person took uh a chance on him in general, right? The same with Apple keyboard. A lot of people hate it, but you have to remember they had to think about the smart shortcuts, the things that people would say, different countries, different words come up as well.

And the team, I'm sure, went to Steve Jobs and iterative uh continuously on this just to get to the spot where it is now, right? And they have to consider, right, who they're designing for the world, these new add-on additions as well. So designed outside the data set, right? So I' I feel like and it's related to what Josh Newton talked about earlier, AI speeds up the loop.

It doesn't replace design craft or judgment in general. So for me, right, I I teamed up with a designer at Canva. Oops. Oh no. Again, &gt;&gt; classic. All right, cool. Hey, hey, hey, chill, chill, chill. We got this. So when I was at Canva, I designed uh columns and layouts. Shout out to my boy Simon Lynn in Taiwan who's a legend who also helped with this.

And these are complicated interactions, right? Not everyone's going to get this. So when we went to ground talk to real users with real prototypes, we had to think outside of the data set. So AI can't solve complex interactions, complex products, you still need to talk to real people and actually test with things that AI can't potentially generate or think about.

So it's the same when we're working in workshops as well. Uh we actually build um code templates. What does that mean? Well, we build code templates of our products and it helps people get into cursor, get into claude and actually build ideas during workshops, brainstorming and empowers everyone, right?

We as designers should not gatekeep in general. We should empower everyone to bring their ideas, build their ideas through AI so we can prompt and actually test this on the ground which is really important. So, it's the same with smart homes and and voice as well. Uh, it's very fascinating to see where Huawei is going with the future as they look into smart homes.

Think about voice and being contextual because AI again can't be reactive. It has to learn. It has to be trained. So, how do you think about this and have a smart system that adapts to normal people's behavior? So, lastly, build the world that you want to actually live in, designers and devs and folks that are here, especially entrepreneurs, right?

Because people are investing in experiences and design is going to be uh that lever that pulls things forward, right? And CTO at SpiceJet uh India, he even talks about it. AI is very expensive right now, but the overhead of of hiring people is cheaper, which is a fascinating quote, especially in the age of AI. So, final hot thoughts and hot takes before I end today.

Mute the trash on on AI design on social media because there's a lot of out there to be honest. Talk to people outside of your networks and bubbles because the average person is actually pretty scared of AI right now and of course it's out. It's fine. Users don't care if your product is better, right? They don't care if you they have a cool feature that's better than your competitor.

You need to actually design for these people and their needs and be contextual. And finally, to the design leaders out here in this region and to the world, I think you have to give space and time for people to actually adapt AI. I've heard too many stories about designers that have actually uh been called out for not designing enough screens, uh being bullied by these poor design leadership because they don't know how to use AI, right? I got told that my work doesn't matter, but guess what?

I I designed a product that went to millions, so I don't know what they're talking about. So to be honest, right, I think it's important to empower your team. So last last point that's not up on here. Christina Caul, she went out to Artemis uh spaceship obviously around the moon. She talks about finding your crew.

So I encourage you to find your crew, your networks here. Feel empowered, feel connected to the network that you're adapting and working with in AI because it is important because in the world that we want to be in, you don't want to be anti-A. You need to be AI fluent. Just be anti-bullshit. Thank you. &gt;&gt; Give it up.

Be anti-bullshit. How many of you are anti-bullshit? I tell you what, I am. Wow. Really? You like The rest of you, huh? Anyway, um please give it up. We have We have a co-m. Check this out. It's Usman, everybody. That's right. Usman, less than half my age. Um I won't tell you what that is. And he's so active in the local community here.

Building. What's the last thing you built, bro? &gt;&gt; The last thing I built was like a, you know, religious app, right? &gt;&gt; Like how you built one. &gt;&gt; Yeah. &gt;&gt; Um, mine is for Muslims all around the world. And how you can uh track your prayers and like all of the different suras in the Quran, which is our uh holy book. Yeah. &gt;&gt; Similar to your Bible. &gt;&gt; Dude, that's so cool.

And you built this? &gt;&gt; Uh, yeah. With Google AI Studio. &gt;&gt; Let's go with Google AI Studio. Give it up. Like a builder. How old are you? &gt;&gt; I'm 13. &gt;&gt; He's 13. What? That's the future. How One last question while they set up. How what's the experience like building with AI Studio? Like are you just prompting things?

Are you writing code? Like what's the &gt;&gt; Well, of course, in the beginning, right, I couldn't like vibe code at all. It took me probably a year or two to really figure things out. Yeah. And it's I've come to a conclusion that vibe coding is not that hard. You just need to put in the hours. &gt;&gt; That's right.

You just need to put in the hours. Fantastic. So, you're introducing the next speaker. Is that right? &gt;&gt; Yeah. &gt;&gt; Let's do it. Give it up everybody. &gt;&gt; All right. So now we have Alex Lee who has come all the way from San Francisco to Singapore and he's come to introduce how AI needs design systems.

Currently the users like the AI studio and all that stuff. The designs are horrible. I'm going to be honest right now. We want designs that actually match the users brands. Give it up for Alex Lee. &gt;&gt; Oh, you need the mic. Sorry, guys. It's like, how is he supposed to do the talk without the mic? Alex, one more time everybody for Alex Lee. &gt;&gt; Thank you.

Thank you everyone. &gt;&gt; Okay, perfect. The slides are here. Um,

AI-native design systems, brand fidelity, and shippable UI generation

Alex Lee, founding engineer at Magic Patterns (AI design tool taking teams from idea to production in minutes; used by 2,000+ product teams including KPMG, Ramp)

Summary

  • Thesis: in the AI era, building features is trivial — consistency and brand fidelity are the hard problem. Design systems become more critical, not less, because they are the guardrails agents need to not hallucinate components and colors.
  • Brief history arc: pre-design-system web = MySpace chaos -> Brad Frost atomic design (tokens; atoms/molecules/organisms/templates) brought consistency, speed, scale -> over-enforcement and bureaucracy -> framework-not-rules era with creative guardrails.
  • Cost-of-implementation collapse reframes the questions: not 'can we build this / how long' but 'do we want this, does it use my design system, does it align to brand, do we want to maintain it' — Opus 4.7 tokens make code basically free.
  • Magic Patterns' AI-native design system primitive: two pillars are documentation AND code (not just Figma mocks, Storybook, or a design.md). System-level rules (tokens: color, typography, spacing) + component-level rules backed by actual code so the agent reads props, variants, and usage directly.
  • Reference customer Headway (mental health / therapist matching platform): Storybook ingested as docs (system + component rules via stories); code ingested as npm module or GitHub-synced for tokens and components. Same generic 'build me a dashboard' prompt produces a generic SaaS dashboard without the design system, but a brand-matched, ship-ready dashboard with it.
  • Design-to-engineering handoff rewritten: no more Figma-mock -> manual token reconciliation. Code-backed prototypes flow bidirectionally via MCP into Claude Code / Cursor / Codex — prototype-to-feature and feature-to-prototype round-trips, both using the same design system components, both high-fidelity.
  • Closing line: 'AI alone will not replace craft. Design systems used to help us build with craft; today they help our agents understand what craft looks like.' Craft = context + intention + humanness.
  • Notable handoff moment: emcee teed up Magic Patterns vs MagicPath rivalry in the room — Sabina Cabrera from MagicPath spoke immediately after, making this segment the on-stage opening of the design-tool wars.

Slides (8)

Slide notes

  • Note: all 8 extracted frames at this segment are speaker-cam / podium / handoff shots — no content slides from Alex's deck were captured in the extraction pass. Alex did reference visuals in-talk (history-of-design-systems graphic; AI-native design system diagram with documentation+code pillars; Headway dashboard with-vs-without-design-system comparison; design-to-engineering handoff diagram with MCP round-trip) but they are not in the JPG set. [t=7:09:54 / 25794s] Segment opens with Alex on stage; [t=7:19:34 / 26374s, slide_08] only content frame is the handoff title slide for the NEXT speaker: 'MagicPath! — Sabina Cabrera for AIE Singapore' — visually marking the Magic Patterns / MagicPath transition the emcee joked about ('I did sort of tell these guys that they exist and they'll go after each other'). Slides 01-07 show: Alex finishing on stage with emcee, Sabina coming up, emcee bridging the two — useful as a visual marker of the design-tool wars block on Day 2 but not as deck content.

Full transcript (1688 words)

Yeah, I'm Alex, one of the founding engineers at Magic Patterns. Actually, just want to get a quick poll. Has anyone actually heard of Magic Patterns before? Raise your hand. Oh, actually there's a couple of you. Super cool. For those who don't know us, Magic Patterns is an AI design tool that gets you from idea to production in a matter of minutes.

We've been used by over 2,000 product teams, KPNG, RAMP, etc. But I mainly work on design systems. And so, you know, in the world of AI, it's been much easier to build new features and new functionalities, but the hard thing we still have is consistency. And so, I'm here to tell you why design systems have not only been needed even more, but are crucial in the AI world today.

And so before I begin, let's talk about some history about why design systems are even needed in the first place. So before everything, the world or the web was the the wild west. Every page was different. It looked like your MySpace page with different widgets everywhere, different buttons. Designers had to reimplement, engineers had to reimplement, and there wasn't really any shared systems in place.

And so to restructure this chaos, we have design systems. It's a shared language that your product teams can use. You have your tokens which represent your colors, typography, your spacing. And thanks to Brad Frost's atomic design, we have a great hierarchy and nomenclature for components. We have your atoms, the buttons, labels, inputs.

We have the molecules composed of those atoms, maybe your form modules or your search bar. And then we have the organism level components and templates to create larger things like your sidebar or your dashboard layout. And so the promise was simple. We had consistency, speed, and scale with the thanks to design systems.

But maybe things were a little too consistent. Maybe, you know, instead of it took a lot longer to add a new button into the design system. There's bureaucracy now. You have to ask the team, can I add this new thing in into this layout? And so we weren't thinking about things from first principles. It was not about what how do we solve the user's problem from the ground up, but it was more how do we use the tools in our design system or components in our tool shed to solve that problem.

And that rigidity was not that helpful. And so the industry took a step back. Design systems can be a little too enforcing. And so let's think of things more as a framework rather than a set of rules instead. This way you can have that creativity but still have those guard rails in place to get your consistency in your brand whether it's your typography, colors, logo, images and set like that.

And so finally we were at peace. We have a way to build creatively but also with guard rails in place and nothing disruptive ever affected the tech world again. Right, man. I feel like in even the last six months my workflow has completely changed. I'm sure for every one of you I don't even write code anymore.

I just asked the agent to write it for me. I'm sure for design product management everything has changed as well. And I think it's interesting, right? The cost of implementation has now become basically free, especially if your company is already paying for those opus 4.7 tokens, right? And so the questions change from can we build this?

How long does it take to build to, you know, do we even want this? Do we need to m do we have something we want to add? Do we want to maintain this? Does this new feature use the components in my design system? Is this new feature align to my brand? And so with that, we have all this chaos that AI has created for us.

And we go back to why design systems were created in the first place. And specifically, we need those guard rails. And so this is what the AIS, you know, in the world of AI without that context, you specifically have things that are not necessarily to your brand, right? Things get hallucinated. You might have components that are hallucinated.

You might have colors that are on to your brand guidelines. And overall, you really need those foundations and context to make things work. It's not just your Figma mocks. It's not just your story book, not even like a design MD. We really need context to align our agents to build things that align to our brand.

And so we came up with a solution on our end, which we call the AI native design system. Obviously, there's not that much difference to a normal design system, but the key things are now that we have two pillars that this design system relies on. Your documentation and your code. You have your system level rules, your tokens, like I mentioned, your color, typography, spacing, and then your components, but specifically backed with code because the more code aligned your design system is, the more close it is to what your users are actually seeing.

And this also allows the agent to understand the props, the variance, the way in which to use those components directly. And so what is a realworld example of this look like? Here's one of our customers, Headway. Headway is a mental health platform that helps people find licensed therapists and they already had a design system and so we helped synced it for them.

We took their documentation and their code and created it the same structure that I mentioned before. Storybook as a source of documentation lends itself to system level rules and component level rules based on stories. And then their actual code ingested either as an MPM module or synced with GitHub for tokens and like I mentioned those components. And it's crazy because I can't show this in a live demo because it might take too much time, but the differences are stark.

I generated these ahead of time, but with the same generic prompt of build me a dashboard, you get something completely different. Without a design system, you get something that works with your UI, right? Or it's a nice generic SAS dashboard, but does not fit with maybe your brand or your product. Same prompt with that design system context.

This matches really closely to what Headway's brand looks like, right? We have our logo. We have our components, colors, typography, all matching in place. And now we're actually able to ship really close, highfidelity code, even with more simple prompts. And now this completely also changes what that design to engineering handoff looks like.

Right? In the old world, I had this Figma mock. As an engineer, I would have to look at it and check my story book, see which components that align to, make sure the color tokens are correct, right? And it was very hard and I had to build everything from scratch. But now we're not even working with designs anymore.

We're working with codebacked prototypes. And because these prototypes are using my actual design system components, I can hook it up by an MCP to something like cloud code cursor codeex and just be like, oh prototype tool, design tool, give me this design, make a new feature out of it. And those same underlying fundamentals, both code bases should be using my same design system components and I should be able to get something at a much higher fidelity. But because those prototypes are also codebacked, I can do it the other way around.

I might have a feature that's not necessarily in mocks yet or in the world of Vibe coding, people are always producing new features. And what I can do now is just say take this piece of code, take this page and convert it into a prototype that I can easily iterate on. And now because of this MCP round trip, I now have high fidelity transfer on both directions. And so as agents evolve, so will our workflows.

But I think the one really hard thing we've not been able to match yet is craft. AI alone will not replace craft because without context, you're not going to have the intention, the touch, kind of the humanness that makes great products the way they are today. But design systems are here to add that context. And so in the past, design systems used to help us build with craft, but today they help our agents understand what craft looks like.

So, I hope this helps understand why design systems have become ever so important in this AI world. Thank you. &gt;&gt; Thank you so much, Alex. And coming all the way from the US, uh the next speaker Woo. Yeah. Um cool. Uh next speaker is going to be uh Sabina from Magic Path, not Magic Patterns. Um, I did sort of tell these guys that, you know, they they exist and they'll go after each other, but I thought they'd be kind of fun.

But, um, yeah. So, I thought it'd be fun to tell a little story as well about Sabina. Um, she actually studied chemistry, I believe. Is that correct? &gt;&gt; Yeah. But now she's a design. &gt;&gt; Is that like Breaking Bad? &gt;&gt; Like Breaking Bad? &gt;&gt; Like Walter White? &gt;&gt; This is Singapore.

We can't say stuff like that. &gt;&gt; Sorry. &gt;&gt; It's okay. &gt;&gt; But anyways, um, but that's cool. I think again um you can kind of study anything and then become anything. And what did you study? &gt;&gt; Nothing. I studied nothing. I have zero degrees. I I'm just I'm uneducated. &gt;&gt; Yeah.

So sometimes guys, you can just do things. No one's stopping you. Just if you're chem if you're a chemistry person, you can design. Um so that is a little quick background about Sabina.

Balancing product and polish in a code-backed design workflow (MagicPath 2.0)

Sabina Cabrera, Design Engineer at MagicPath (NYC-based AI design tool; canvas-first, multi-agent, hooks into Codex / Cursor / Claude Code; MagicPath 2.0 shipped two days before the talk and introduces multi-agent enterprise collaboration)

Summary

  • Title evolution itself was the lede: talk was submitted in March as 'Should designers [insert May 2026 design trends here]?', collapsed through 'Should designers code? Should designers design?' and landed at 'Should engineers design? Yes' — a deliberate pivot to the AIE engineer audience and a tell on how fast the design-tool space has moved (skills/agents weren't even dominant when she submitted).
  • Engineering-friendly design crash course in ~90 seconds: sans-serif (Inter — the developer-site default; play with tracking/kerning) for clean/approachable; mono (JetBrains Mono / IBM Plex Mono) for technical 'numbers feel'; serif (Times-style) for authoritative/professional (origin: actual Roman-era brush stencil marks); shaders/WebGL via unicorn.studio with Maxim Bondarenko's blog (Linear engineer) for the math; hero-page animations = clone the codebase, branch, prompt 'make it pop'; the 'right-click -> Inspect element -> Computed tab -> paste into MagicPath' move as an anti-gatekeeping primitive for both designers and engineers.
  • Thesis (the row title): designers don't iterate-reflect anymore — they refresh-the-page until 'good enough', stuck in single-viewport tunnel vision. MagicPath's positioning is a canvas-based design tool that surfaces past iterations side-by-side so you can actually see and choose, vs the back-button shuffle other tools force. Quote: 'You will only get better at something if you see your past work — a digital canvas helps with this.'
  • Format-agnostic philosophy: 'I don't know how the hell you guys are designing. It really doesn't matter.' MagicPath is positioned to meet teams wherever the design lives — half-baked Next.js app, Figma file, teammate's head — and not force a native file format. Explicit anti-lock-in framing aimed at Figma / Magic Patterns / Lica.
  • Skills/SKILL.md aside that lands on craft: don't just yolo a SKILL.md you found on Twitter into your chatbot — 'do I want every single thing? No.' Same as taking an unreviewed PR: dealing with someone else's AI crap doesn't spark joy. The polish half of the product-vs-polish balance is curation discipline at the context layer.
  • MagicPath 2.0 announcement (shipped 2 days before talk, May 15 2026): (1) canvas-first single-user design, (2) multi-agent — side chat bar + multiple agents on the same canvas, (3) enterprise multi-player — marketer, CEO, designer in the same workspace ('too many cooks in the kitchen, see what comes out'). Co-built with Chloe Park.
  • External-agent interop is the durable hook: MagicPath plugs into Codex, Cursor, Claude Code, and the Amazon IDE. Pitch is 'don't burn more credits — use up your existing Pro subscription on MagicPath.' Founder Pietro Schirano posted a viral demo (3M+ views referenced on slide_10) showing MagicPath running as a native canvas inside Codex. Closes the loop in both directions: design surface -> codebase -> edits to local file -> back into MagicPath as a clean file with live shareable links.
  • 'Design from everywhere' soft pitch: hook MagicPath up to Telegram / WhatsApp so designs can bake async while you do other things (her explicit anti-pattern is designing on a phone — too much cognitive overload).
  • Closing framing — 'Nothing great is built alone, part two': team is NYC-based downtown Manhattan, design system hosting is the most-asked feature and is supported, QR code unlocks 1,000 free credits on top of the signup default. Parting koan, deliberately broken: 'If you want to go fast, go alone — if you want to go far, use MagicPath, then with your team, then with your enterprise multi-agents.'
  • Handoff context: this is the second half of the Magic Patterns / MagicPath back-to-back the emcee teed up at the end of seg_29 ('I did sort of tell these guys they exist'). The room got both sides of the design-tool wars in roughly 30 minutes.

Slides (18)

Slide notes

  • Segment runs from t=26402s (7:20:02) to t=27241s (7:34:01), 14.0 minutes / 2898 words / 18 extracted frames. Several frames are pure speaker-cam / handoff shots and are flagged below. [t~7:20:02 / slide 01] Speaker-cam: Sabina on stage, mic in hand, white blazer + dark green jacket layered over, jeans. No slide content visible — segment opener as she walks the audience through her NYC origin and corrects 'MagicPath not Magic Patterns'. Drop as deck content. [t~7:21:00 / slide 02] TITLE SLIDE — 'Should designers [INSERT MAY 2026 DESIGN TREND HERE]?' / 'Should designers code?' / 'Should designers design?' all struck through. Footer caption: 'Evolution of an AIE talk title'. Lower-third: 'Sabina Cabrera / Design Engineer, Magicpath / Should designers design? Balancing product vs polish'. This is the title-as-thesis frame and the canonical screenshot for the post. [t~7:22:00 / slide 03] CONTENT — sans-serif explainer with Modal Labs marketing site ('AI infrastructure that developers love') as the live example. Footer caption: 'Sans Serif fonts are considered clean, approachable, easily read'. Anti-gatekeeping pitch to engineers: this is the Inter / developer-site-default lesson. [t~7:23:00 / slide 04] CONTENT — terminal/git UI mock: 'main' branch + commit hash '61da504 Add lib/utils and deps ne…'. Footer caption: 'Mono fonts feel more technical, computer-y, true'. The JetBrains-Mono / IBM-Plex-Mono explainer. [t~7:24:00 / slide 05] Speaker-cam over a MagicPath UI screenshot (canvas + sidebar + prompt input visible top-left). Footer caption: 'If you like these websites…'. Setup for the 'right-click inspect, copy computed layout, paste into MagicPath' move. [t~7:25:00 / slide 06] CONTENT — three-pane composite: a live Modal-style marketing block with shader gradient, a Chrome DevTools HTML inspector, and a Computed-tab font/layout panel showing 'font-family: Inter Variable, ui-sans-serif, system…', 'font-size: 14px', 'cv11', '772px'. Title: 'Right click -> inspect element -> dig around in Computed tab'. Footer caption: 'How to figure out what you like on a website — div blocks are just layers!' This is the practical anti-gatekeeping primitive — the most operator-actionable slide. [t~7:26:00 / slide 07] CONTENT — text-only thesis slide: 'You will only get better at something if you see your past work / A digital canvas helps with this'. Footer caption: 'Canvas-based design tooling'. This is the title-track thesis line — pair with the title slide for the post. [t~7:28:00 / slide 08] CONTENT — text-only parting setup: 'If you want to go fast, go alone'. Footer caption: 'Parting words'. Setup for the broken-koan close ('…use MagicPath, with your team, with your enterprise multi-agents'). MagicPath 2.0 announcement lands verbally over this slide. [t~7:29:30 / slide 09] CONTENT — title 'Make anything from anywhere with external agents' over a terminal screenshot showing 'Claude Code v0.X.X' welcome panel. Footer caption: 'Use Codex, Cursor, Claude Code to design in MagicPath'. This is the IDE-interop slide — load-bearing for the harness-as-moat read. [t~7:30:30 / slide 10] CONTENT — same 'Make anything from anywhere with external agents' title now showing a Pietro Schirano tweet: 'You can now run MagicPath as a native canvas inside Codex to design and build functional apps. It's pretty incredible. Here's how to do it' — engagement metrics visible (357.6K views, posted 3:08 AM May 16 2026). Embedded video thumbnail of MagicPath canvas inside Codex. Same caption as slide_09. This is the viral-demo receipt for the post. [t~7:31:30 / slide 11] CONTENT — 'Thank you! In this QR code:' bullets — (1) link to talk slides + recommended readings + inspo, (2) email + twitter, (3) high-level features of MagicPath 2.0, (4) onboard your company + design system hosting, (5) MORE CREDITS — 1,000 free on signup. QR code right-aligned. Footer caption: 'MagicPath 2.0 is here!'. [slide 12, 13] Duplicate / near-duplicate captures of the QR thank-you slide as Sabina holds the mic during Q&A intro. Same content as slide_11 — collapse with slide_11 in any deck-extract. [t~7:33:00 / slide 14] Speaker-cam: emcee (cap, dark jacket) plus a younger emcee-in-training on stage, large 'A' wordmark stage-right, Singapore skyline backdrop. The 'we're training him' bit — pure handoff frame, drop as deck content. [slide 15] Speaker-cam: emcee-in-training solo on mic (the bit where he forgets the next speaker's name). Drop. [t~7:33:30 / slide 16] Speaker-cam: next speaker Priya at the podium under low light, getting set up. Drop as Sabina deck content — this is segment 31 territory. [slide 17] Speaker-cam of the emcee-in-training again, mid-handoff. Drop. [t~7:34:00 / slide 18] CONTENT — next speaker Priya's title slide projected behind the stage: 'Training AI to become your design partner / Priya [name + role]', browser-chrome chrome around the slide (Arc / Chromium tab bar visible). Captured at the very tail of seg_30 — technically belongs to the next segment but useful as the visual marker that the design-tool block continues into seg_31. Slides 01, 12, 13, 14, 15, 16, 17 are speaker-cam / handoff / duplicate frames.

Full transcript (2898 words)

Hello, my name is Sabina. I came all the way from New York City just to talk to you guys. I'm so excited to be here. And I am a designer at Magic Path. Not patterns path. Light mode, dark mode or light mode. Um, so it's funny. I actually hosted a workshop. If any of you guys were there two days ago, hello again.

Um, and I completely redid my talk this morning because I realized, oh my god, I'm talking to like capital E engineers. So this is for you. Um, if you saw on the uh schedule, my talk was should designers insert May 2026 design trends here. Uh, and that was penned in March because I was like, Sherry, this space is moving so fast like god knows what, right?

Like I don't even think skills were prevalent before uh I I submitted this talk. So, um, that evolved. I didn't do that. Should designers code? Should fish swim? That didn't work. Should designers design? This is actually a good point. I will uh come back to it. But I think if you're a designer right now, who's also touching code against their will?

Yeah. Okay. And then I realized, wait, I'm not talking to the right crowd. Should engineers design? Yes. And so this talk is going to be for you guys nerds. Um so, uh if you for me engineering is really scary because div blocks are scary, but if you think of div blocks, it's flex flexbox. And if you could go flexbox, it's auto layout.

So in like 90 seconds, I'm going to teach you everything you need to know to take my job. I hope you take my job, right? I'm tired. So if you see a font that looks like this and you're like, that's very clean, very easy to read, very human, right? Um I prompted that this morning. Like this is called uh sand serif.

Uh it's very approachable, very human. You probably see on every developer site modal, you know, linear claw. They have their own thing. They're expensive, but inter is a very good reliable thing. And people usually just play around with the tracking and the kerning. You know, if you ever see that A versus A, that's just like an expect element.

You can change it out, right? If you see this font, you're like, "Wow, I'm technical now. I'm seeing numbers. I'm seeing something that's very scientific." This is called a mono font. Blank mono guys to mono is probably what you need to know. It's very like, "Oh my gosh, if I go on my website, like tech, right?

That's awesome." If you see this font and the difference, you know, attention is all you need. Latte is in it. Um, Times Roman, anything that's kind of serious, uh, anthropic answering my question of should I drink five shots of tequila before this. Very authoritative, very professional. This is called SIF.

And if you want to know in 3 seconds why we have a difference, SIF is when uh, back in the like Roman or Greek ages, uh, people would draw like kind of what they were going to like stencil out. And these little marks are from actually like the paint brushes of people drawing. So that's literally where it's from. Now you know.

Okay. If you see something like this, shaders, interactive things. If you see, wow, like how the hell does that happen? I don't know WebGL. Um, yeah, this is shaders. All you need to know is that uh you can go to unicorn.studio, get that done. If you want to actually know the math behind it, go to my friend Maxim's blog.

He works at linear. He's fantastic. Um, and that's everything you need to know. So, um, let's see what else. No gatekeeping here. You're like, "Wow, I'm on a hero page. Here's Magic Path's website, which you will all see soon. Here's Cursor's websites. How on earth do they do these hero animations?" Guess what, buddy?

Yeah, that's right. You just take the codebase, you throw in an animation thing, and you make a new branch and you say, "Hey, make it awesome. Make it pop." Um, usually people have a recording of their product on here, but I advocate for this because, uh, you kind of want to speed things up. You know, there's like kind of a etiquette when it comes to making people wait through your AI generated whatever.

Uh, and it's just faster and you can do a lot of really cool things. Like if you see my prompts, I'm just like, make it pop, make it bigger, like make it in 10 seconds, whatever. Okay. Also, I'm not gatekeeping designers, too. This is for you. If you ever see something on a website and you're like, "How the hell did I do that?" You rightclick it, you go to inspect element, and you dig around until you find the computed layout, and you copy that into um magic path, which you'll see soon.

And yeah, this is um this is all to say that I think it's really interesting. Engineers have taste, right? I writing good code requires some sort of like finessing. And I think design has been such a blackbox for engineers that they don't realize like no, you can have taste with this kind of stuff, too.

Like everything you just saw, like that's 2026 designers in a nutshell. Like Um, I didn't go over instrument sands, but okay. So, something I want to kind of segue into is how are we kind of defining design and work today? Design today, there's a lot of um there's a weird uh pattern that we've kind of behavior that we've encouraged of you iterate, refresh the page, iterate, refresh the page.

You're kind of stuck in the single viewport and if you want to see a version you kind of have to like do this awkward dance of like pressing the back button or whatever and like you don't really think you kind of like iterate until it's like good enough but you don't really pause and reflect and think about oh wow like maybe there's something good from this iteration versus this iteration right you're just kind of moving forward uh and not being introspective which apparently is uh not masculine so um given how hard it is to predict the oh what does that say uh the future of design like I work you know at magic path and I see a lot of design tools that are like oh um you got to export it as this whatever file like oh you have to natively make it in there. My thesis is like I don't know how the hell you guys are designing. I really it doesn't matter. Um I want to be able to give you guys the best tool possible to kind of meet you wherever you're at.

Whether your design is like in a halfbaked next.js JS app, if it's in a Figma file, if it's in like your head, if it's in your teammates's head, it doesn't matter because um yeah, I mean, creativity comes from anywhere and I don't want to be the person to tell you where it comes from. So, I was talking to Sher. She actually invited me to this talk back in March and I was like, "Hey, like uh I don't know what I'm g I don't know what I'm going to give a talk on." And literally, this is what I told her. Like, I made these slides the day of.

So, it's not out of laziness, it's out of accuracy. So, yeah. Okay. This is a quote I think everyone should remember. I think this is kind of like the whole thesis of this uh conference. Uh I'm just going to read it out loud. John Collison, who is like one of the Collison brothers, part of Stripe, he says, "As you become an adult, you realize that things around you weren't just always there.

People made them happen. But only recently have I started to internalize how much tenacity everything requires. That hotel, that park, that railway, the world is a museum of passion projects." And I say this to say, you know, uh, you know, some people just throw skillmd files and they're like, you know, put the fries in the bag, whatever. But I think there's a beauty of like kind of understanding like, wait, before I just like park this skillmd file I found on Twitter in my chatbot.

What's in it? Like, do I want every single thing? Like, do I even like yes, it's Airbnb's design system, but do I want every single thing? No. You kind of want to finesse things, right? It's it's similar to like whenever like someone gives you like a PR that was obviously not looked at like they can't explain every single line of code.

Not that they have to, but you know like dealing with someone else's AI crap doesn't spark joy. I think everyone can agree on that. Let's see. Okay. Um this is all to say like I'm saying this all from the heart and um you know as a designer before this I did a AI design startup where I tried teaching people design.

Uh so you know there is no corporate shilling hat on here. But now there is wait. Damn I wish I was smoother. If you want to go fast go alone. If you want to go far you should use magic path. Then you should use it with your team. Use it with your enterprise multi- aents. We just released it two days ago. So there's me Chloe Park.

If any of you guys know her, she's fantastic. Um, so you can not only design in Magic Path, one on a canvas, which I think is the right way to go, two, with multiple agents, whether it's the side chat bar, um, and three with your actual enterprise team. So, get the marketer, get the CEO in, like put too many cooks in the kitchen, see what comes out. You know what I mean?

Um, the cool thing is that, you know, I have been seeing all the love for cursor and all the love for codeex going on here. I'm such in awe. And the great thing is that you can actually use magic path with your existing tools. So I give a workshop I use cloud code but you could use codeex you can use whatever like I think I saw someone with like the Amazon IDE was that Kimmy or uh anyway you can hook up Magic Path to these different agents say hey like you know if you have like a bunch of you know pro subscription credits like use that up on Magic Path.

Don't feel like you have to buy more credits. Like again we're trying to meet you where you're at. Um my boss Pietro who is such a trip if any of you guys know Pro he's like such a crazy guy but he made this really awesome video and like where he just shows using codeex you can make these really amazing designs and I think like this next generation of design is just going to be about you know we have the tech we have to communicate to people that no this is how you can actually achieve like engineers designing and designers learning how to work better with engineers um so we have all tech it's just like being able to you got to put it in people's face and be like, "Hey, hey, you know, use this." Um, design from everywhere. I actually had someone say like, "Oh, I wish I could design uh with Magic Path from my phone." I would never do that because I think that's too much cognitive overload.

But if you want to hook it up to Telegram, WhatsApp, whatever, you can like you let your design bake and then go check on it later. So, making that uh aware. So, the cool thing again is closing the loop between design and code. Um, I don't have it on here. Oh, no, I do. I do. Uh, but basically you can have a magic path design, put it in your codebase, finesse it.

Even if you do edits to the local file, you can put it back in magic path so you always have a clean file. And again, these all have live links, so you can send it over Slack, send it over iMessage, I don't know, whatever. Um, and yeah, so uh this is kind of just like a wish it was bigger, but this is just me trolling around my file like uh you know, again, because it's a paintbrush, I want you to make art. I want you to make projects, things that might never be shipped, but at least you told yourself, you like spread everything out and like really thought about it, right?

Because I think in the future we need to do things that make our brains wrinkle a little more. I think mine's like, you know, like inflating. So, um, you know, this is just me playing around with art projects. Like I plugged in I bought a Japanese texture pack off Twitter and like I like hooked it up to my you know local uh agent or my external agent and then it put really awesome things in magic path and I can see that being used for like landing page or some other creative endeavor.

Okay so the last thing I wanted to say is oh shoot over um this is my incredible team nothing great is built alone part two. We are primarily based in New York City. If you're ever there come say hi. We're in downtown Manhattan. It is such a blast. And okay, so take a picture of this because guess what? All the slides are on there as well as recommended readings.

The myth of the paperless office. There are some blogs that like, you know, Maxim's blog is there. There are some really good resources there for you guys. I also have every single slide. It's not totally accurate, but it's up there. Um, my email and Twitter, please tweet about this. If you um actually make something and you DM'd me, DM it to me or if you DM me in general or send me an email like I would love to like personally onboard you and help your team get set up and yeah, we can host your design system.

I actually think that's the biggest question I've gotten. They're like, "Oh, can I transfer my design system to here?" Yes. Uh I think that's it. &gt;&gt; Thank you, Sabina. &gt;&gt; Huge round of applause for Sabina. Everybody keep it going. Yes. Get the mic, young man. We uh look how many of you design images with like chat GPT or Claude or some Yeah, many.

Okay, this is like 10% of the room. Um I think many of you don't do it because one, it's kind of &gt;&gt; you know, uh like we kind of know what slop looks like. Um or it's it it makes mistakes. Six fingers, right? Anyone see Katy Perry at the Met Gala? You know what I mean? It's a cool art. Anyway, um image generation either for brand assets like logos, um business cards, things like that has always been somewhat of a challenge because we know what slot looks like, but also where's where'd you go?

Oh, there you are. Did you get a mic? Go grab it, bro. No, they they don't need it yet. Go get it. It's fine. Anyway, this is BTS. Anyway, um so here's the deal. When you get when you get um and I'm I'm invested in this now. Hang on. Oh, let me just can we use this to introduce her and then we'll give you the mic.

Thanks. Um, anyway, so when you get an image from an AI model, you get one image. It's like a flat image, you know, but if you're a graphic designer, you work with layers. You know this, right? Like like you have like a background and a foreground and all kinds of layers. Well, how cool would it be if AI could do that for you?

Give you like a Figma ready thing with all the layers that you can use. And that is what I'm getting ready to hear about. I'm very excited. Who's the next speaker? &gt;&gt; Priya. Introduce her, bro. &gt;&gt; Yeah, I know. &gt;&gt; It's okay. He's new, but we're training. We're training. &gt;&gt; Okay. So, now our next speaker is going to be Priya, who came also who also came from San Francisco to Singapore, which is a 17.5 hour flight.

And she's going to be talking about how AI can become your design partner and help you create some really cool stuff like uh similar to Canva but better. &gt;&gt; I don't know. Anyway, just that's free. Give her the mic. Fantastic. Give it up for Priya everybody.

Death by Prompting: Layered Editability as the Real Architecture for Creative AI

Priyaa Kalyanaraman, Co-founder & CEO, Lica World (infrastructure to train and evaluate creative AI models)

Summary

  • Priyaa Kalyanaraman (CEO, Lica World) opened with 'death by prompting' — live-demoed Gemini/nano-banana producing progressively worse Devil Wears Prada poster edits when asked to swap a silhouette for a blonde woman. The thesis: one-shot pixel-output models are an engineer's idea of how creatives work; creatives iterate, walk back, scrap, restart.
  • Names the design-intelligence triad explicitly: taste (thousands of micro-decisions on corner radius, crop type, margin, kerning that AI flattens), iteration (designers move backward and sideways, not forward), and layered editability (every prompt resets state because MP4/PNG are frozen file formats with no exposed layers).
  • Why labs haven't solved this: no data. Code has GitHub; graphic design has three walled gardens — Figma, Canva, Adobe — sitting on billions of editing traces nobody else can touch. Lica's wedge is collecting the missing corpus.
  • Shipped: 1.5M layered graphic-design compositions across 50+ categories (Instagram, presentations, posters, business cards, planners, brochures, infographics, etc.), with semantic + logic group annotations, crop metadata, layer positions — open-sourced as a dataset explorer plus a Graphic Design Bench. Usable as Claude-agent skills, training data, or eval harness.
  • Architecture argument: instead of one giant model owning the canvas, expose layers so specialized models can be delegated per layer (SVG model for vector, text model for copy, photo model for image-fill, brand-LoRA for guidelines). Enterprise constraint named: marketing teams generating banners across SE-Asia languages can't burn tokens regenerating the whole creative when they only need to change one text layer.
  • Reward design: multi-signal reward learning combining human-preference rewards (gameable, taste-relative) with objective rewards measuring design-principle validity. A self-updating AI judge refines its rubric over time and feeds back into generator retraining — design has a 'shelf life,' so the loop must continuously expose fresh good examples.
  • Closes with a pointed concession: 'My slides were all inconsistent and all over the place. That's how AI models are today.' Layout planning, visual consistency at scale, and layer-level editability without human-in-the-loop remain extremely hard — Twitter/LinkedIn hype notwithstanding.
  • Distribution: huggingface + GitHub links + Graphic Design Bench at lica.world; handle @pritopian. Positioned as data infrastructure layer, not a competing design tool.

Slides (22)

Slide notes

  • [t=7:34:01] Slide 1 (title): 'Priyaa Kalyanaraman, CEO, Lica World — How to train and evaluate design intelligence across expressions of taste.' Hero illustration: 'death by prompting' — frustrated designer at desktop, three cartoon mascots (cowboy, robot, banana) on the desk.
  • [t=7:34:35] Slide 2: Live Gemini session — Devil Wears Prada poster with prompt 'change the image mask to fill it with a woman with blonde hair and green eyes.' First output: replaces silhouette with a red blob. Second: progressively weirder.
  • [t=7:35:00] Slide 3: Grid of four progressively worse Gemini outputs — illustrates state-reset on every prompt and loss-in-translation from text-only input.
  • [t=7:35:30] Slides 4-5: Lica's own output — same prompt, same source poster. Isolates the silhouette into a discrete layer and fills only that layer with the correct image, preserving text/composition. Demonstrates layer-level localized edit.
  • [t=7:37:00] Slide 6: 'Walled gardens' illustration — three castle/garden compounds labeled Figma, Canva, Adobe. Verbal claim: 'They hold billions of editing traces and data that none of the labs have access to.'
  • [t=7:38:00] Slide 7: 'Introducing LICA: A Dataset of 1.5M Layered Graphic Design Compositions.' Table 1 — component-type distribution: Text 8,146,222; Image 7,019,837; Group 5,313,914; Vector 4,198,047; Total 24,678,054 individually annotated elements; avg 15.93 components per layout. Table 2 — 20 design categories: Instagram Post, Presentation, Planner & Calendar, Education, Flyer, Social Media, Business Card, Card & Invitation, Print Product, Planner & Calendar, Business Document, Poster, Video, Logo, Banner, Menu, Infographic, Brochure, Postcard, Art & Design.
  • [t=7:39:30] Slide 8: Lica canvas view — Devil Wears Prada poster with the Layers panel exposed on the left: 'THE DEVIL, Wears prada,' 'MERYL STREEP,' 'ANNE HATHAWAY,' 'FOX 2000 PICTURES PRES...,' Image 2, Image 1, Background.
  • [t=7:40:00] Slides 9-10: Dataset Explorer UI on lica.world — category tabs (Instagram, Presentations, Planner & Calendar, Education, Business Cards, Memes, Social Media, Business Documents, Movies, Print Products, Cards & Invitations, Posters, Banners, Brochures, Logos, Infographics, Animations, Memes, Album Art Designs). Asset shown: kids' astronaut poster. JSON annotation panel on the right with image/group elements, bounding boxes, IDs.
  • [t=7:41:00] Slides 11-12: Big-text reveals — 'breaks every' and 'LAYERS into' — argument that one-shot output breaks every iteration; layers are the answer.
  • [t=7:42:00] Slide 13: Mock UI — Layers panel (Text, Image layer 1, Solid, Image layer 2) next to a historical battle scene with 'MAKE CLAUDE FREE' text. Prompt bar: 'GPT 4.5 — Make the background a bit darker.' Demonstrates per-layer model routing.
  • [t=7:42:30] Slide 14: Same UI zoomed — 'GPT 4.5' model selector dropdown with avatar 'Larry' next to it. Multi-user / multi-model layered editing.
  • [t=7:42:50] Slide 15: Same UI — model swapped to 'Nano Banana,' layer 4 PNG selected. Shows model swap per layer, same canvas.
  • [t=7:44:30] Slide 16: Closing QR + 'www.lica.world / @pritopian' — links to HuggingFace dataset, GitHub repo, Graphic Design Bench. [t=7:44:55–end] Slides 17-22: speaker-cam, MC handoff to Usman, intro for next speaker (Hyperspell — 'How to build a company brain'). Drop from notes.

Full transcript (1833 words)

&gt;&gt; Good evening. Uh thank you so much for that intro. I feel like you explained uh a lot of the things that I was going to talk about. Um my talk is I'm the co-founder and CEO of Leica and we're building the infrastructure to train and evaluate creative AI models. And what that really means is I spend all day yelling at image generation and video generation models because they don't understand our prompts.

And we are working on building the infrastructure to get them to be better at it. And uh we want to avoid the problem of death by prompting. Um I think he asked this question. How many here have used chat GPT or nano banana to generate slides, presentations, social media posters? And I didn't see any hands go up.

Are you all lying? Okay, now I see more hands go up. So obviously most of you use chat GPT or nano banana to generate images. And I'm sure um I'll share like what I was doing today and most of you might empathize with what I was going through. Um this is like devil wears Prada a poster and I asked uh the I asked Gemini to replace the image mask uh with a woman with blonde hair and then this is what it gave me.

That's okay. Uh I still had some patience left in me and then I prompted again and this is what it gave me. And then it continued to get weirder. This is what I ended up with and now I completely lost my So I I thought okay this is not going to work. So this is like progressively worse results that I saw.

So what we do at Leica is a little bit different. So if this is the image and this is the same prompt I gave change the image mass to fill it with a woman with blonde hair and green eyes. Uh it isolates everything into layers and then it fills that layer with that exact image. The level of localized edits you can make is crazy.

If you have layers exposed and you're able to delegate each layer, you can also move the text around. You can change anything that you want here. And you might be asking why is this small? Well, I guess like the reason why we're able to do it is some of the uh companies that are doing image generation or video generation, they output MP4s or PGs and they are frozen file formats and the layers are not exposed and every with every prompt the design state is reset and text is not a very interesting input uh medium because many people don't know how to verbalize what they want.

So there's a lot of loss in translation and there's no human AI multiplayer experience today because of that and the way we have solved it is really to do this layer level editability and layer level editability is not just for humans to move things around but there might be other specialized models that you could use for different layers. It could be for text generation, SVG generation, photo generation. You don't always need to use one giant model for everything. And you might be asking why should a startup tackle this?

Why haven't the big labs already solved this problem? And the honest answer there is there is no data. With code, there's a ton there's tons of like GitHub repositories. LLMs have gotten really good at text processing. Whereas with graphic design, you just have these three giant companies. They're all walled gardens.

Figma, Canva and Adobe hold billions of editing traces and data that none of the labs have access to or no one in the community have access to. So when we as a startup decided to tackle this problem head-on, we thought from first principles and also decided to uh tackle the problem of like what is the missing gap in the market and that is data. So we went ahead and collected over 1.5 million layered graphic design composition. So what that looks like is some of this has been open source.

So you can actually go and check it out. This is like a fun explorer that we built where we have put out data from across so many different design categories, 50 plus categories from Instagram to a business presentations to posters and each data point has several rich annotations on what the image looks like, what are the crops like, what is the positions of it and if there are semantic and logic groups then you can actually see which elements need to be grouped together. So you can teach an AI model how to do refflow of content or if an aspect ratio needs to be changed, it really knows how to plan the layout. All of these things all the frontier models today suck at.

And you can play around with this data and uh parts of this have been open source. So you can also give a lot of these configuration files as uh uh skills for a cloud agent and it performs a lot better and you can also train models or build eval on top of it. So the way we approach this problem is you can get oneshot outputs today from ton of degenerative AI models and some of the results are really really impressive. But when you hear comments like AI lacks taste, what that really means is designers obsess over details.

Somebody is thinking about what the corner radius of a rectangle needs to be. What should the crop type be? What should the distance from the margin be for a text box? And thousands of these small tiny decisions is what elevates a design. And AI models don't quite understand how to think about some of these tiny decisions.

And every small misstep here makes that output very hollow and sloppy. So the way we've approached it is really to isolate everything into layers and each layer can be shaped very differently with proprietary data from an enterprise or other data collected uh uh from elsewhere and the layer level data is going to be very helpful because in an enterprise people don't have unlimited tokens to spend especially in marketing functions let's say in e-commerce you have to generate banners that align with a certain brand guidelines across so many different countries. Let's say in Southeast Asia where there's like ton of languages and you just want to change the text or specific graphics but retain all the other elements as is. You just want to be able to manipulate those layers.

Or there are instances where you want to combine camera generated imagery with some parts of human written copy and fill the other pixels with something that's AI generated. And you should also be able to combine a constellation of models because as more and more models come out, you probably want to delegate different aspects of design to different models. And this architecture allows for it. Because today, if you want to oneshot everything, that is an engineer's idea of how a model should function for creatives.

Whereas creatives, creativity is just inherently incremental and iterative. You walk a few steps backward, then sideways, and then you probably decide you want to scrape the design and start over. And the current models do not allow for that. We also came up with a multi- signal reward-based learning system where design is easy to game if you just use human preferences.

And especially when you work with brands which have different expressions of taste, you want to be able to come up with part of rewards that are human-based preferences and augment that with certain objective uh rewards that measure whether the output is valid and meet certain design principles. And then we have two models. One is an AI judge that is able to increasingly update itself on its rubrics so it can get better at discriminating good from bad and then use that updated AI judge to retrain your generator that can continue to get better because design has a shelf life and you constantly want to expose really good examples and train your model to be up to speed and also build an architecture where you move beyond textbased prompting so that you can capture different types of interactions that can be part of the training loop. This is not the reality today.

I never smile when I'm working on evaluating any of the image generation models. But if there's anything you want to take away from this talk, it's that my slides were all inconsistent and all over the place. And that's how AI models are today. No matter what the Twitter hype or LinkedIn hype is, models are very poor at layout planning.

Getting visual consistency at scale without human intervention and editability, especially layer level editability is extremely extremely difficult. So if you're interested, you could scan the QR code. Uh we have the hugging face link, GitHub link if you want to use the data set that we have open sourced and we've also put out a graphic design bench. You can use that to train your cloud agent or uh you could also try to use that as eval uh if you have internal models that you're training or reach out if you're interested in this space.

Thank you. &gt;&gt; Y'all are such a wonderful audience. Always clapping for your speakers when they do a great talk, which is all the time. Great. One more round of applause for Priya, everybody. So good. So good. Our next speaker uh is is so cool. He he has an amazing story that you're going to hear in just a minute as we introduce him.

Uh I'm not even going to introduce him. I think he's a pro now. Give it up for your other MC. Everybody, Usman. &gt;&gt; Thank you. All right. So, now we're going to be introducing our uh next speaker who's come yet again all the way from San Francisco to Singapore. And uh that's a 17.5 hour flight by the way.

Anyway, he's come a long way throughout his journey where he came from zero to hero. He used to live in a hacker house. Uh specifically the closet and and uh he was a college uh no not college uh high school dropout at 12 and now his company uh what was your company? &gt;&gt; Hyperspell. Uh now his company Hyperspell has come such a long way that it has raised over $6.7 million not 67. &gt;&gt; All right.

Hey, give it up for the announcer everyone. Let's go. You did an amazing job. All right. How are we all doing? Final day of AI engineer. Let's finish strong and make it happen. Hey everyone, my

How to build a company brain: context infrastructure as the real AGI bottleneck

Conor Brennan-Burke, Founder & CEO of Hyperspell (context-infrastructure / company-brain layer for AI agents, contracts-infrastructure pitch line, SF-based)

Summary

  • Core thesis: agents are 'clueless geniuses' — savant-PhD-intern level reasoning, but every day is day one because they have no company context. AGI is not gated by better models; it is gated by getting the right context to already-brilliant models.
  • Connectors are not enough. Plugging Claude/ChatGPT/OpenClaude into Slack/Drive/Notion gives access, not understanding. Agents naively trust whatever doc they hit first, miss corrections, follow stale versions, conflate five Lisas into one entity, and have no recency model. Documents are a lagging indicator — the 'source of truth' is rarely true.
  • Solution: one company brain per organization — a single source of truth with confidence scores, authorship, recency, conflict resolution, and entity-resolved threads across Slack/Gmail/Notion/meetings. Not RAG-across-sources; an arbitration layer that decides what is actually true.
  • Critical source-mix expansion: context is no longer just human-generated. Agent prompts, OpenClaude/Claude memory, agent traces, Claude Code reasoning loops, meeting recordings, screen recorders, even Meta-style keystroke logs — all of this is execution-trace context that must enter the brain.
  • Four knowledge types the brain must store separately: (1) stable facts (legal entities, org chart, brand), (2) process knowledge (onboarding, deal review, incident response), (3) tacit wisdom (only-in-heads — flaky tests, sales tricks, dead integrations), (4) stateful reality (open deals, active incidents, today's blockers). Each has different decay rates and storage discipline.
  • Build pipeline: ingest (historical + real-time, Slack-style streaming) → normalize (entity dedup, one Lisa) → synthesize (escalate genuine conflicts to humans for arbitration) → serve. The hard part is synthesis, not search.
  • Representation choice: context graph for the substrate, but expose to agents as a file system — because agents are post-trained on file systems, not graph DBs, and a filesystem is universal across Claude Code, Cursor, OpenClaude, NanoClaw, and internal/personal agents. Hierarchy: company → team → individual, with /decisions, /events, /people subdirs.
  • Endgame — recursive organizational learning: every human and agent action emits traces, traces feed the brain, future executions get better. One Claude Code instance's learnings propagate to the whole org. Solves 'people context drain' (employee walks out, takes context). 'The company gets smarter without adding another meeting.'
  • Pitch line / positioning: Hyperspell = 'contracts infrastructure for AI agents.' Outbound channel: conor@hyperspell.com, @contextconor on Twitter, hyperspell.com.

Slides (8)

Slide notes

  • Demo-heavy / short deck — only 3 substantive content slides, 5 frames are speaker-cam or stage-handoff to next speaker (Hanjun Lee). Recording the substantive frames only. [t=7:44:55 / slide_01] Title slide: 'Your agents are clueless geniuses.' Subhead: 'They can reason brilliantly inside the context you give them. They just don't know what's stale, who to trust, or what changed last week.' Red pull-quote: 'Intelligence isn't the problem. Context is.' Speaker lower-third: 'Conor Brennan-Burke, Founder & CEO, Hyperspell — How to build a Company Brain.' [mid-talk / slide_02] 'The company starts learning from itself.' Subhead: 'Every action creates context. Every context update improves the next action.' Five-stage circular loop diagram: Humans and agents execute → Work creates traces → The brain captures and synthesizes → Context updates → Future execution gets better → (loop). Red pull-quote: 'The company gets smarter without adding another meeting.' [closing / slide_03] 'Every company needs a brain. Hyperspell builds it.' Subhead: 'Context infrastructure for AI agents.' Contact strip: hyperspell.com | conor@hyperspell.com | @contextconor. Visual: stylized brain on dark background. [slide_04 through slide_08]: speaker-cam / stage-handoff frames showing Conor walking off and emcee introducing the next speaker (Hanjun Lee). No slide content — dropping from notes.

Full transcript (2125 words)

Name is Connor Brennan Burke. I flew all the way here from San Francisco. 17-hour flight. I am incredibly jet-lagged right now, but we're going to push through it. Woot. Let's go. All right. So, we at Hyperspell build company brains. And what I'm going to tell you today is how to build a company brain. That's right.

How to make it so that agents actually understand how your company works. And this doesn't work. All right. There we go. All right. So, I think this is a theme we've heard quite a bit today from different speakers. Um, your agents, to put it bluntly, are clueless geniuses, right? They are they're like this um, you know, savant, PhD, slightly autistic intern that is absolutely brilliant, but doesn't know anything about your company.

Every single day for them is like the first day at work. They blindly follow uh, whatever they read. They're kind of naive. They'll take instructions and just go with it. And so you need humans to watch over them. The problem and the thing that gets us to AGI is not better models. The models are already brilliant.

It's getting the right context. Your agents are clueless geniuses and the lack of context is the reason that they don't yet get work done reliably. All right. And so the question is how do you solve this? So the obvious answer is connectors, right? We've all done this. We've said all right I'm going to give my openclaw access to my slack my drive and my notion I'm going to use connectors in anthropic and claude and chatgbt but the problem with this is again as we've said agents are kind of naive anything they read they assume is true but it turns out that documents themselves are not actually often true um so they'll find a doc they'll miss the correction they'll find an old version that's out of date uh if there's two different sources they'll conflict with each And whichever one they find first, they'll interpret it as true.

There might be the same person mentioned in Slack and Gmail and Notion. They don't realize it's the same person. They're like, they think it's five different Lisas instead of one Lisa. And there's also no recency, right? You find old, deprecated, outof-date documents and they try to operate off of that. Um, so connections give access.

They don't give understanding. So everybody here, I know not everyone here is working yet, but the folks who have, how often have you gone in and started a new job and read a document and been like, "Oh, okay. This is our strategy or this is the process and then you go do it and you talk to somebody and it's like, oh no, that's out of date. That's no longer relevant.

You got to talk to Bob instead and Bob knows all of it and like talk to this person." How many people has that happened to almost every single person here, right? And so the thing about this is that by giving connectors to agents, we are assuming that truth is in documents. But that's not how things actually work. So the source of truth, as we call it, is rarely true.

It turns out that the moment information is created, it starts to become out of date. Documents themselves are a lagging indicator. You might have a reorg change or a customer exception or a new deploy. And so reality gets far away from the dock and it requires human beings to update docs to make them true.

And so how companies actually operate is you have the extremely messy reality where there's Slack threads and meetings and emails and exceptions and all these things happening and then you have this document. So people try to record stuff but we're all not great at updating documents and recording them. And then you have what's actually true. And so often, as we just said, the way to get to what is actually true is by asking someone, right?

You ask your boss, you ask the person who's been there for like five years who has all the context. And so, human beings are good at understanding this. You know, not to just blindly trust any document you get in any process. You ask the person, but agents don't know to do that. Anything they read, they assume is true.

And this is why you can't just let them run across your organization. If we want to deploy agents at scale, we need to give them a source of truth. So how do you solve that? You create a company brain. So every single organization needs one single source of truth for agents. One company brain. Now what is that?

It's not just connectors. It's not just rag across sources. It is one source of truth that has confidence in it. That understands who created this document. That brings in threads together from the email and the slack and the noten and the messy meeting that surfaces conflicts and identifies okay there's two different sources that say different things how do we resolve between them it figures out what reason and it creates one source of truth that agents can actually trust what does that give you it gives you better answers it gives you aligned agents it gives you durable knowledge it makes your organization ready to deploy AI the reason that so many enterprise AI deployments fail is they try to deploy agents that agents read the documents and there is no company brain for them to operate off of.

This is the thing that we need to make AI actually work. Now the other nuance here is we assume that context is human generated but that is not true anymore. So traditionally, you know, you have people in meetings and slack and docs and emails. But now you have miscontext, right? You have the meetings you have with humans.

You also have all of your prompts to your agents. You also have your open clause memory. You also have the traces and reasoning. And that is context too. The loop that you go with with your clawed code to get to the final output. All that context is useful. And if you don't put that into the brain, you're missing a huge source of context to get to the final output.

And so we are all moving from organizations that are mostly human to mixed to in a few years, the majority of context is actually going to be created by agents and all of it needs to be in the brain. Now there are different types of knowledge inside of companies that you need to put in this brain. There are stable facts, so things like legal identities, your org chart, your brand colors. Then there's process knowledge.

How do we do onboarding? How do we do deal review? How do we respond to incidents? There's also tacet wisdom, things that are only in people's heads. And so remembering, okay, how to close this customer or what's a better sales strategy or uh this particular test is going to be flaky or this integration doesn't work well.

There's all have this tacid knowledge. It's very rarely written down and it's very rarely in a single source of truth that you can make work. And then finally, you have stateful reality. So you have open deals, active incidents, today's blockers, and the company brain needs to have every single one of these and store them all differently.

In order to get to a source of truth, you need to know how quickly things are evolving and what that central place is. And so the way to actually build this is we found from our experience working with everything from tiny startups to mass and Fortune 500 customers is that you want to start by ingesting all the data and so pull in all the sources of truth again your Slack your Gmail your notion your GitHub more and more now we have meeting recorders as well you also have your agent traces right agent generated context now for example meta is starting to do this and meta is even logging keystrokes some people have recorders of their screens. You need to embed all of that. The next thing you need to do is create a context graph.

What a context graph is is it's a single graph entity that finds every single fact in the organization, understands when it was true, who authored it, how confident are we in it, and embeds that all in one place. But the thing is context graphs and graph databases in general are not great UX for agents. Agents are not post-trained on them. They don't natively have an understanding of how graph databases work.

And so actually the best way to represent this for agents is a file system. And so you create a file system with for example data at the company level. Who are the people in your company? Who are the prospects? Who are the customers? You have your decisions? You have events. Below that you have files for each team.

And then you have each individual. And the great thing is because file systems are universal. You can use them with cloud code. You can use them with cursor. You can use it with open claw, nano claw, internal agents or even your own personal agents. Now let's talk through how the brain gets built. Um the first step is context capture.

So you have all these messy sources. You need to ingest all the historical data but you also need to get it real time. Something like Slack has real time context and if you miss that as it's happening then your agents won't have up-to-date information. The second thing is normalizing it. So we talked about uh that understanding the LISA in your emails the same as the Lisa in Slack getting to one single entity dduplicating it structuring it.

The next thing is synthesizing. So sometimes data conflicts and when data conflicts you need to actually bring that to humans to say okay we have this trade-off which one do we choose and then finally you serve it to the agents and have a single source of truth. Get the agents the right context at the right time so they can get work done. The hard part is actually not search, it's synthesis.

It's bringing all this information together. Is anyone here familiar with Carpathy's second brain idea? Okay. Does anybody have a second brain already? And has anyone here used obsidian as a source of personal truth or uh second brain? So this is exactly that but for your entire company, your whole team, every single person in your organization and every agent in your organization.

Now what this enables is you get to a point where the company starts learning from itself. Every single action that you take creates context. The humans execute, the agents execute, work gets done, new context gets created. All of those traces then are synthesized and put into the brain and then future execution gets better.

Imagine where every single claude code instance can now take those learnings and those new uh takeaways and share them with the entire organization. Every salesperson if they learn a better way to sell that's instantly shared with everyone. What this enables is companies that improve recursively over time. Traditionally we've had people context drain where people walk out the door and then take their contacts with them.

Now you can get organizations that self-improve, that get better, and every single person, every single agent constantly makes it better without adding another meeting. So we are Hyperspell. We believe every company needs a brain. We build it for you. We are contracts infrastructure for AI agents. If this is a problem that you want to solve, you should find me afterwards.

Send me an email or find me on Twitter. Thanks everybody. &gt;&gt; Wow. Just wow. Such a genius, bro. &gt;&gt; Give it up for this guy. &gt;&gt; I can't believe just an ordinary guy like him could make a a masterpiece like that. Uh I believe all of uh everyone's brands or companies deserve a big brain like his.

Please welcome our next uh speaker who is hangong hang hong Lee and he has come to show all of us that uh we can all shift fast with code and how like you can uh do something as good as him. Thank you. And please give it up for Hangong.

Plan, Preview, Orchestrate: Three Primitives for Shipping with Cloud Coding Agents

Heng Hong Lee, Co-founder of Lightsprint (YC-backed, three Singapore founders building a cloud-coding-agent workbench / harness-of-harnesses)

Summary

  • Lightsprint is a current YC company started by three Singapore founders building a workbench for teams to run cloud coding agents reliably, quickly, and safely against existing codebases — pitched as 'command a fleet of AI engineers in the cloud.'
  • Defines cloud agents as managed, non-interactive (background) agents that work while you don't — distinguished from in-IDE coding agents like Cursor/Claude Code that only work when you do. The promise: 3-5x throughput, organizational backlog drain, and agents that learn the org over time.
  • Three failure modes Lightsprint is built around: (1) agents don't get the right context/plan, (2) you don't know where the agent is or if it's stuck, (3) reviewing PRs sucks when you can't spin up a working preview — 'nobody hates a PR to review more than the one that doesn't work.'
  • Three primitives in response: PLAN (visual plan wizard that generates multiple-choice options with recommended + 'other,' grounded in codebase context, with interactive in-app visual previews of how the feature will look before spec generation), ORCHESTRATE (Kanban-style task board where you fire off agents and track where each one is), PREVIEW (live working preview environment for every PR before review).
  • Architected explicitly as a harness-of-harnesses: Lightsprint wraps Cursor, Claude Code/Anthropic, Codex as underlying coding-agent harnesses, plus its own 'Lights cloud agent' which is itself 'a harness around that harness.' Same pattern for plan mode — supports Gstack and Lightsprint's own plan mode.
  • Live demo flow: prompt becomes an enriched task → plan wizard generates options + visual mock → user picks/refines/adds options → spec ships to a chosen coding agent via bot → agent returns ~20 minutes later → previewable working app before merge.
  • Lightsprint dogfoods Lightsprint — 'cloud agents first' is the operating posture; only small mobile/local bugs run locally, everything else gets fired off to the board.
  • Framing question the team is chasing: 'what is the nature of work in the AI age' — three Singapore founders with product/engineering backgrounds positioning the workbench as the answer.

Slides (19)

Slide notes

  • [t=7:56:35] Slide 1 - Title card: 'Heng Hong Lee, Co-founder, Lightsprint. Plan, Preview, Orchestrate: Three Primitives for Shipping with Cloud Coding Agents.' Background: Lightsprint platform screenshot showing task 'design a repo home page v2 that uses ask as the chatbox and has a personalized b...' with an Approach panel ('Server-ranked bento with streamed AI sort') and an agent chat panel on the right.
  • [t=7:57:10] Slide 2 - Same platform shot; demo of plan mode generating server-ranked bento approach with a wireframe preview region appearing under the approach card.
  • [t=7:58:00] Slide 3 - 'Repo home v2 architecture' empty agent panel — showing the orchestration / task-launch surface before agent starts.
  • [t=7:58:40] Slide 4 - 'Ask conversation expansion' multiple-choice plan wizard: two options side-by-side — 'Push-down inline expansion' (recommended, green border) vs 'Right side slide-over panel.' Each option shows a visual mock of the resulting UI plus a written rationale. THIS IS THE PLAN PRIMITIVE in one slide.
  • [t=7:59:20] Slide 5 - 'Bento grid responsive behavior' plan step — two layout options shown as visual mocks: '2-column with hero span + single column on mobile' (recommended) and '3-column masonry → 2-col → 1-col.'
  • [t=8:00:05] Slide 6 - Same bento step zoomed — recommended option highlighted with green check; demonstrates the 'restrict the AI to show something visual' principle Heng Hong narrates.
  • [t=8:00:40] Slide 7 - 'Card click behavior' plan step: 'Navigate to detail page' (recommended, green) — shows how the plan wizard cascades through each design decision.
  • [t=8:01:20] Slide 8 - 'Empty state when user has no work items' plan step: 'Single CTA with illustration' vs alternative; reinforces the multiple-choice plan pattern.
  • [t=8:02:00] Slide 9 - Lightsprint dashboard view: left nav (Home, Notifications, Tasks, Plans, Insights, Health), main pane shows 'Your work' with task cards (Fix OAuth token refresh race condition, Add rate limiting to /api/hooks, Notification system redesign, Migrate user settings to new schema, 8 functions exceed complexity threshold, Update error boundary components). THIS IS THE ORCHESTRATE PRIMITIVE — the workbench/task board.
  • [t=8:02:40] Slide 10 - SpongeBob 'A FEW MOMENTS LATER' meme card — used as the cut between firing the agent off and the agent returning ~20 min later.
  • [t=8:03:15] Slide 11 - Lightsprint.com landing page: headline 'Command a fleet of AI engineers in the cloud.' Subhead: 'Lightsprint is a product workbench where engineers and PMs ship together on your real codebase, not a prototype. Describe what you want, see visual options styled to your app, preview it live, and ship a real pull request.' CTAs: 'Start free' / 'Book a demo.' Right column: agent activity feed.
  • [t=8:03:50] Slide 12 - Lightsprint sign-in page: 'Welcome — Sign in or create an account for free. Continue with GitHub / Continue with Google / email.' Tagline at top: 'Collaborative product development for AI-native teams.'
  • [t=8:04:20] Slide 13 - Speaker on stage gesturing — SG skyline backdrop (Marina Bay, Singapore Flyer outlines).
  • [t=8:04:50] Slide 14 - Wide stage shot, 'AI Engineer Singapore' banner + 'AIE' standee; Heng Hong center-stage. Speaker-cam frame.
  • [t=8:05:30] Slide 15 - Speaker on stage, SG skyline backdrop, mid-summary of three primitives.
  • [t=8:06:10] Slide 16 - Stage transition — emcee at podium handing over.
  • [t=8:06:40] Slide 17 - Emcee on stage with SG skyline backdrop, intro to next speaker — riff about 'everybody's got this orange thing going on' (Claude color scheme observation across talks).
  • [t=8:07:30] Slide 18 - Emcee continuing: framing the hot take — 'in the future we will likely just plan and orchestrate agents that write and ship code' — reinforcing the plan/orchestrate framing as the dominant pattern across the day's talks.
  • [t=8:08:00] Slide 19 - Next-speaker card: 'Louis Knight-Webb — @tokengobbler / vibekanban.com / london.aitinkerers.org' — handoff to Vibe Kanban talk.

Full transcript (2064 words)

&gt;&gt; All right. Thank you, Usman. That was great. Right. Thank you everybody for coming. Right. Today I'm going to talk about the three primitives that we need to ship fast with cloud agents. Right? Everybody wants to ship fast and I was telling man behind that we should like multiply ourselves, right? How can we multiply everybody and right now?

So like we are light sprint, we are current YC company, we are three Singapore founders and we are three curious Singapore founders. We are looking to um figure out what the nature of work is going to be in the AI age, right? The nature of work is changing really fast with like we three of us we have a bunch of experience doing product doing engineering and we're trying to figure out what that means, right? So right now we're building cloud agent environments, right?

We're helping teams build their environments up so that their whole team can ship um make changes to codebase and existing code bases um reliably, quickly and safely. Right? What is cloud agent? Right? This this slide you probably you know know everybody's talking about cloud agents today um and yesterday and the day before.

So quick one most cloud agents are mostly from a managed environment. See, they are like a coming out of the the cloud basically and a service that's set up usually set up by the company, right? They're also non-interactive. So, you fire them off and then they go around and they build something and they come back with what they built works in the background.

Sometimes they're called background agents. Some people confuse cloud agents and background agents. They're the same thing. They just work in the background. A quick a quick kind of like primer bring everybody where kind of how we got here, right? Starting we started with the agent inside the computer helping us type cursor.

I was an early cursor user that was super fun like command K and everything. And then we had coding agents cloud code cursor again, right? Everybody is inside our computer. It works when we work. It stops unfortunately when we stop. But now today like cloud agents basically they they are everywhere. They work for us all the time.

Um, if you know how to control them. And so today we're talking about that. The promise is great. Like cloud agents hope to reshape your organization. They want to kind of um build out like your backlog, finish your backlog basically like you know they can build anything. Um, anybody can kind of like put together stuff.

And the last thing is like similar to what Hyperspell is doing like you know they the promise is that they're going to learn your organization and help you make operations better. The best of these companies are already using cloud agents. So uh they get a three to 5x like improvement and sometimes even more. Uh some of the startups that we're talking to are using them very effectively and a lot of people are seeing that they like the number of PRs that are merged just created coding agent created PRs are rising incredibly fast.

Okay, so now how to make cloud agents work for you instead of against you, right? So um a lot of times like you might not give the cloud agents the right context and so um we want to make sure that like you want to make sure that you give the right agent the right context and you want to do um make sure to give them the right plan and context. The other thing that you want to do is to make sure that you know where the agent is at any point in time. Right?

So you want to make sure that you have the cloud the agents doing the work that you're you're asking them to do and you are able to check where the agents get stuck or are they currently stuck or are they still working. The last piece is as an engineer I feel like uh it's super important like is if if my whole team is going to ship code to me I need to review them and if I need to review them then I need to spin up uh coding environment for that and then I need to make sure that that works and the worst is it doesn't work and then I have to go back and tell them that it doesn't work and they need to um rebuild the PR which I could just do by myself. Right? So at Lightream, we're thinking of it in three primitives.

You need to plan properly so to make sure the agent has the rest has the best uh stuff. You need to orchestrate, you need to make sure that you know where the agents are, and you need to preview. So I'm going to jump real quick into like, you know, our app. And so I I feel like I've talked a lot, but I haven't shown anything.

Um so right now I'm trying to I'm going to introduce you our application. Oh, this is the middle. Ah, yeah. So this is the Lightprint platform. And so the live stream platform is basically a workbench for your team to collaborate, right? So you can see that like it's just a bunch of um camb boards and a lot of tasks and stuff like that.

And then yes, plans at the side you can see. So basically what's happening here is that we are basically helping uh to create tasks. So we we put the prompt in the task format and so it actually is grounded by the codebased context and basically able to quickly uh enrich your task with a lot of information so that the coding agent can like get launched. So we support a whole bunch of coding agents.

We have cursor, we have entropic, codex and and these are just harnesses under our system. And basically we have our lights cloud agent which is also a harness around that harness. Right? Once you launch the cloud agent you can basically click through to dive into the codebase into the the code inside. So what you're seeing now is plan mode, right?

We we want to change this screen. It's kind of boring. It's not AI. Is just basically a list of your recent tasks and your recent plans, right? So, let's uh let's kind of use our plan mode right now. So, we figure fire our plan mode. We support right now Gstack and our own Lightrint plan mode, right? So, we're using our current light plan mode.

And basically what it does is that like the idea here is that we want to create a multiplechoice with recommended and with other and that's really like our favorite like use case, right? Like people everybody loves choices and everybody loves like you know um the AI coming up with choices for them right but we also love visual right we love like to see the mock so we also made the AI kind of restricted the AI to kind of say please make a good experience for the user by showing them something visual right so we allow the user to pick like different options even make more options right say okay you know please give me two other new options and then those two options will just be added along and then you can pick from those as well. Right? So we we didn't really do much to tell the agent not to do things but we just basically gave them a set of like um uh guiding principles.

So, what you get at the end after you've made all the selections, you actually get a full-fledged like inapp um uh preview of how your your feature is going to look like. And it's kind of interactive sometimes, depends on what the AI chooses. And then here, you can also make changes to the colors in in our case. And after that, we're going to generate a full like spec that will just send it over to the coding agent, right?

It's going to put it on our bot and then we're going to send it and then um we'll check back in maybe 20 minutes later to kind of get get that sorted out. Right. So now it's like picking the agent and shipping it off. Yeah. And a few moments later. So now it's done. And so now we can actually go into the preview part of our system.

Right. So that's really a big thing for me. Um it's able to look and click through the app. And this is basically once set up for any software factory out there. If they don't have a preview mode, you have to ask them, hey, you know, how how can my guys preview the app that I have created a PR for? Because that's super important because nobody hates a PR to review more than the one that doesn't work, right?

So like we leave it for the um all the whole team members able to preview the app before we send it over. So we have been using light sprint at light sprint and we've achieved a lot of success having super a lot of fun doing things in parallel and also kind of like doing things in p like on on local host. So we are mostly cloud agents. So if something is like a a mobile bug or a you know a small issue that that people talk to us about we'll put that on the board and then we'll fire off a cloud agent to do it.

Right. So lightrint will build cloud agent first right we we have plan we we think that people should plan with uh a plan and you can use our visual plan wizard right they should orchestrate and they should preview um and that's super important right so here are my socials and lip sprint like links so feel free to take a screenshot and use them and thank you so much for attending this Ah, woo. Thank you. Hang is such a cool product.

I was watching the demo back there and I was like, "Whoa, I I can be a full-on product manager now." Right. That's so cool. Thank you so much. You know what I've noticed over the past few talks? I've noticed a a consistent color scheme. Have you noticed this, too? Uh, right. It's it's all clawed. No. Anyway, um, no no no disrespect.

Everybody's got this orange thing going on, and I'm just like, whoa. It's it's a it's kind of interesting. It's a little bit derivative. Anyway, um our next talk I'm excited about this one because hey, can we get a round of applause for the organizers actually? They did such a good job. They did such a good job.

Really, really a Grim Sherry. Everybody did such a great job because you may not know this, but the talks are structured in such a way that they they lead into one another. Okay? And it's so cool. There's a natural order here. So the previous talk was about um the the the project management side and the next talk is also about that.

There's maybe a hot take here which um you know the Louis our next speaker will clarify but the hot take is in the future we will likely just plan and orchestrate agents that write and ship code. So so the job of writing and shipping code moves and we just become planners and orchestrators. Um, and and that's kind of the thing. Lou is going to tell us a story also about his previous business um that that tried to get traction but couldn't.

Uh, and and you know, I'll say this. W stands for win and L stands for lesson. And so he's going to have some lessons learned here. Uh, please, your biggest round of applause for Louie. All right. How we doing, Singapore? Woo. Let's go. Oh, it is 5:00 p.m. On a Sunday. Let's keep the energy high. Okay. Last thing standing between you

Planning and review as the human loop around parallel coding agents

Louis Knight-Webb, Co-Founder of Vibe Kanban (a kanban-board UI for spawning, parallelizing and reviewing Claude Code / Codex / 6+ other coding agents on per-ticket worktrees) and organizer of the AI Tinkerers community in London. He recently shut the startup down and is now talking publicly about what he learned.

Summary

  • Frames the thesis: as coding agents approach high accuracy, the developer job collapses to two things — planning and reviewing — and almost all the IDE/debugger/network-inspector software stack we use today is overhead for the code-writing part that's disappearing.
  • Origin story (May 2025): juggling multiple Claude Code instances across a wall of terminals made him imagine the UI for the post-babysitting era — a Jira-like board where each ticket has a play button, picks an agent (Codex, Claude Code, six others), runs in parallel, and yields a diff + live preview for human review.
  • Names the time-budget shift across the GitHub Copilot (2021) -> ChatGPT -> Cursor (2024) -> Claude Code arc: percentage of time looking at code shrinks, planning + review expand. Open question whether we 'get the writing-code time back' or just reallocate to managerial work — he answers: both, but mostly reallocate.
  • Operating heuristic: spend five extra minutes in plan mode (Codex/Claude Code plan mode, or a spec-driven framework with exhaustive interrogation) and you collapse review cycles from N revisions to one or two — 'push the slider toward planning whenever you can.'
  • Workflow taxonomy he flags as under-discussed: backend logic and migrations express cleanly in language and are well-suited to plan-heavy, multi-agent-parallel execution; front-end feature work has too much visual/interaction context to specify upfront, so it stays review-heavy and human-in-the-loop serial.
  • Tool-use loop time is exploding: Copilot (seconds) -> Cursor (~30s) -> Claude Code (~5 min) -> projected 20-30 min within a year as type-checkers, Playwright, more verifiers get folded in. Sitting and watching a terminal becomes untenable — parallel orchestration UIs become structurally necessary, not nice-to-have.
  • Post-mortem honesty: Vibe Kanban shipped a lot of primitives (per-ticket worktree, agent selector, structured review pane) that 'seem obvious now' but weren't in mid-2025; he still shut it down. Implicit lesson: pioneering an interface ahead of the accuracy curve doesn't guarantee a business — the moat shifts when the agents themselves shift.
  • Closing claim: the emerging job is managerial. Code-writing junior work evaporates; architecture, planning, review, coordination — historically tech-lead functions — become the whole job. The product opportunity is interfaces that maximize developer focus on plan + review and nothing else.

Slides (8)

Slide notes

  • [t=8:07:51] Slide 1: Opening image — wall of terminals showing Claude Code running across multiple panes (the May 2025 'my desktop' screenshot), Cursor sponsor logo bottom. The visual hook: parallel agents juggled by hand, no orchestration layer.
  • [t=8:08:30] Slide 2: Title card — 'What's gonna happen when coding agents get really accurate?' with speaker ID block: Louis Knight-Webb, Co-Founder, Vibe Kanban. Subtitle: 'All we're going to do is plan and review.' This is the talk's thesis in one slide.
  • [t=8:10:45] Slide 3: Vibe Kanban product screenshot — left pane shows a planned-approach checklist (component additions, prop updates, header behavior, regression tests) with 'Plan approved' state; center pane shows the Projects board ('Manage your projects and track their progress', 'My first project' rows); right pane shows the agent picker (Vibe Kanban Agents / Working Branch / Terminal). Demonstrates the three Vibe Kanban primitives: plan-artifact, ticket board, agent-router.
  • [t=8:12:30] Slide 4: 'Did we get back the time we used to spend writing code? ...or do we now spend more time planning and reviewing?' Below: a 2025/6 stacked time-bar showing Planning | Reviewing my code | Reviewing other people's code, with a tiny 'Writing code' sliver wedged between the planning and review blocks. The full visual argument for the talk's central claim. [Slides 5-8 dropped — speaker-cam frames, applause/handoff, and the start of the next speaker (Harsha, Interfaze) loading 'The AI Model We Built for Deterministic Developer Tasks.']

Full transcript (2111 words)

Louie. Uh, I until very recently was the co-founder of a startup called Vibe Camban. Um, I also run an AI community in London called AI tinkerers. So, if you're ever in London, come along to an event. You'll have fun. Um, and what I want to talk today uh about is why I was building this startup and why I shut it down.

How basically the job of software engineering is quickly becoming essentially plan and review code that is generated by AI. Uh and I guess I don't know how many people in the room are kind of interested who who is like a startup founder or is going to found a startup at some point in their lives probably. Okay, good. Well, I I will try and talk about like some of the reasons why we ended up shutting the company down and like maybe what could be gleamed and learned from that as well at the end.

Uh so very quickly I'll tell you about what it is we were building. So you got to go back to ancient history. It is May 2025 and my desktop starts looking a bit like this. I've got like loads of tabs open. Claw code has just dropped and I'm trying to juggle running multiple agents in parallel. And I started thinking this is kind of a completely new way of doing my job.

And what is going to happen when accuracy goes to 100% and I no longer actually need to babysit what the agent is doing. And I started imagining what that interface would look like. And essentially it's like all of the parts of software engineering without the codew writing part. Um and if you think about a lot of the software that we have idees, debuggers, uh UIs for like testing, network requests, things like that, most of the software that we use is actually for writing code.

And so if you eliminate that part of the job and you're just left with the planning part and the reviewing part, uh you could come up with a radically different UI for that. So we started building Vibe Canban and it's kind of in the name basically. It is a canban board where you create tickets kind of similar to how you would do in Jira. Um but the difference is you can click on any of those tickets click a play button and then you will have an option to run it in codeex in claude code or six other different agents.

And once something has finished running you then get a nice interface to review that work. So one of the ways is reviewing the code obviously. Uh, another way is testing something if it's a, you know, a website or an app or something like that. Um, so this is all ancient history. Seems really obvious now.

It wasn't very obvious in June 2025. And a lot of what we were doing at the time was kind of pioneering new ideas. There's a bunch of stuff that we shipped that we then deleted from the app that I'm not showing. So it took some experimentation to get here. So why did we do this? Well, it is because everything is becoming planning and review.

Um, if you think about how you might budget your time for the different tasks involved in software engineering before the GitHub co-pilot moment in 2021, most of our time was spent in an IDE kind of scrutinizing code, looking at code to some degree. And what's happened over time is that that has shrunk as a percentage of total work that we do. So you get the co-pilot moment and then you know suddenly autocomplete is completing a lot of code and then you get chatgpt and you're able to like paste in code and get you know another function out and paste it back in or you you no longer need to go to stack overflow. It's kind of you know making it a lot faster to iterate.

Then you get cursor in 2024 and it's almost like you're still looking at the code but you've got this kind of chat on the side and then eventually you get to kind of where we are today which is claw code where to be honest I think you know there's a lot of vibe coding going on. You almost don't even need to look at uh what's going on. Uh, and so I guess it kind of poses an interesting question of like do we get all of that time that we used to spend writing code back or has it just shifted work to other parts of the development process? I think the answer is probably a bit of both.

I think it has sped up uh the the overall job of software engineering, but at the same time I am now spending a hell of a lot time more planning and reviewing the work that I have to do. It depends. And so one of the ways in which and this is kind of like more of a practical way to think about how this framework of like planning and reviewing is useful is I think you can actually speed up your work with agents if you figure out how to get them to be really accurate. Uh and one of the ways to get the accuracy of coding agents up is to spend more time planning.

So what do I mean by that? I mean the most basic version of this is like the codeex or clawed code plan mode. So just use it. I use it for absolutely everything. Uh the kind of the complicated version of that is use a framework. So there's lots of great specdriven development frameworks out there which I'm sure there's been talks on.

Um and you can do this like interrogation method where you get it to ask you questions about the task that you're working on exhaustively until every possible question that you could have about a task has been answered. But the key thing is you're basically spending more time in planning before you ask an agent to do something. The consequence of that is that most times your agent will complete the work uh accurately and it'll need maybe one revision, two revisions. The other way, and this is something that I think we're all a little bit guilty of, is you don't spend a lot of time planning and you suffer the consequences of needing to do a lot of reviewing.

So, you know, how often do we just throw in a kind of loosely defined feature and, you know, complain when the model gives us something back that is halfbaked or is just completely missing the point. And so you you're more likely to go back and forth with the model more times if you spend less time planning. I think the other dimension to this is actually the type of work. And this isn't something I've really seen talked about too much.

It's kind of a halfbaked thought, but if you think about the types of engineering work, feature development is just radically different from migrations. And so these different workflows around spending a lot of time planning versus you know uh and and maybe if you're doing that you're able to run more than one agent at the same time versus a reviewheavy more human in the loop flow where you're not running things at the same time that probably favors uh more front-end work. So you know it's difficult sometimes actually to express all of the requirements for a complex front-end feature. There's a lot of interaction involved.

There's a lot of visual uh you know things that need to be communicated uh versus backend where you're describing logic and it's it's much easier to kind of find a common language I find when you're describing backend logic and therefore you know the planning and running multiple things in parallel tends to work a bit better for me in those situations. So, uh, TLDDR basically if you spend five minutes planning, you will probably save yourself a lot of time reviewing. And I recommend always, you know, pushing the the slider that way whenever you have the chance. Okay.

And then we can use history to kind of figure out where things are going. So, GitHub Copilot would run for a few seconds before giving you a result. Uh you know the original version of cursor 2024 would work for you know more than that 30 seconds before yielding a result and we're at claw code where it's kind of running for like five minutes before giving me a result on average and so the reason that's happening is because there's increased tool use so we've got agents giving you a response agents running a type checker then giving you a response agents running a type checker then using playright then giving you a response and you can extrapolate that like you know as more and more jobs are brought into the loop. Basically, the time that coding agents are taking is increasing.

And so, we're at this interesting point in the history of coding agents where we're about to go really quite far beyond what is comfortable to sit there and watch. Like, what are you going to do when a coding agent runs for 20 minutes? You're not going to sit there and watch your terminal, you know, like twiddling your thumbs. I mean, you might procrastinate and end up on Twitter or something like that, but I I don't think that's a good use of of of my time, and it gets boring very quickly.

So, you know, if I had to predict, I would say a year from now, you know, we're probably looking at, you know, these things running for half an hour, and we'll need to find ways to paralyze this uh a lot more. Um, okay. And I think I'm almost out of time, so I'm going to wrap up with some quick observations. I think basically the work that is emerging is is is managerial.

So if your job on a team of software engineers was to write a lot of code and not do a lot of review and not do a lot of architecture and all the other things that you know maybe would be associated with like more senior or tech lead roles. All of the other stuff is basically going away the code writing part and what will remain are all the kind of traditionally managerial functions. Um, and yeah, I mean, we should be building experiences and interfaces that maximize the focus of the developer. So, things that keep them focused on what is important, eg planning and review.

Right, I'm going to have to wrap there because I'm out of time, but thank you very much and it's great to be here. Thanks, Singapore. &gt;&gt; Keep it going for Louie, everybody. &gt;&gt; That was an incredible talk. I am a I am a manager now. We'll give it to the next speaker. I'm a manager now. Hey, how we doing?

How we feeling in here? What? Why are you even here, man? Go sleep or Anyway, he wants to do something. &gt;&gt; All right, so let's play a little game. Can you guess what our next speaker Wait, no. &gt;&gt; What? &gt;&gt; I didn't mean what, bro. My &gt;&gt; Can you guess? &gt;&gt; No. Can you guess where our next speaker flew in from?

Uh, shout out your answer by the way. Your options are Singapore. I mean like he stayed in Singapore of course and then Sri Lanka or yet again San Francisco. Shout out your answer. Come on. &gt;&gt; SF. She said it. Where is it? &gt;&gt; San Francisco. Hey, we're on a San Francisco train. Yeah. Give it up for San Francisco, everybody. &gt;&gt; Way too many people from San Francisco. &gt;&gt; That's that's where it happens, brother.

AI engineers. &gt;&gt; That's where dreams are made. &gt;&gt; Yeah. Yeah. What a great quiz. Thank you. Thank you so much. Give it up for your coc everybody. Usman, our next talk &gt;&gt; comes to us from Harsha who works at Interphase. It's an AI research lab and he's going to talk to us about how they train specialized coding models with a new architecture beyond the transformer.

So your biggest round of applause for Harsha. &gt;&gt; Thank you. Thank you. Great intro by the way.

Interfaze AI: specialized encoders + LLM decoders for deterministic developer tasks

Harsha Khurdula, Co-Founder & CTO, Interfaze AI (YC-backed research lab reinventing transformers for deterministic OCR, object detection, ASR — claims #1 on M-OCR-Bench vs Chandra OCR and Reducto)

Summary

  • Core thesis: context engineering has limits — even GPT-5 / Claude / Gemini hallucinate on deterministic dev tasks (OCR, detection, ASR, translation) because of context drift on large inputs. Architecture, not data, is the bottleneck.
  • Architectural move: bring back the rigidity of classical ML as task-specific encoders (DNN/CNN/transformer stacks) and use an LLM only as the conditioned decoder. Encoder gives bounding boxes, confidence scores, and metadata you can trust; decoder turns that into structured output (e.g., age computed from extracted DOB).
  • Live demo 1 — Virginia commercial driver's license: Interfaze extracted all text WITH pixel-level bounding boxes, detected both faces (main photo + specimen ghost photo), and correctly computed age from DOB. Reducto extracted text but failed both bounding boxes and the age inference.
  • Live demo 2 — LinkedIn scrape of Garry Tan: bypassed LinkedIn's bot/blocker stack via an in-house scrape model, returned full work history as JSON all the way back to internship — beyond the 'see more' button.
  • Live demo 3 — dense PDF translated to Hindi with character count: Interfaze preserved addresses and author names untouched (correct safety/relevance behavior) while Claude 4.7 Opus failed all three retries on timeout; broader point on long-horizon multilingual failure on South-Asian languages.
  • OCR pipeline: CNN-stack encoder detects text regions → crop image at each region → feed crops to decoder → optional second decoder conditioned on extracted fields for structured output (where the 'compute the age' step lives).
  • Open-vocabulary object detection: text encoder + image encoder + contrastive scoring head + oriented box head. Unlike YOLO (closed-class), takes a natural-language prompt ('detect everyone wearing a black t-shirt'). Pixel-precise segmentation adds a mask decoder producing per-instance masks with confidence.
  • ASR stack: VAD-style speech localization → crop audio chunks → acoustic-feature encoder for embeddings → clustering for speaker diarization; separately, audio → spectrogram → frame-by-frame classification for transcription. Claims fastest latency and lowest WER among compared models.
  • Unifying design: 'Same Latent Space, Task-Specific Adapters' — three modalities (vision, audio, text) trained as adapters into a shared decoder, so multimodality is solved with provenance (you know which encoder produced which span).
  • Positioning: explicitly benchmarks not against frontier LLMs but against the 'economical, one-shot, production' tier — task-specific models that are cheap, low-latency, and deterministic. YC company (S-batch backed).

Slides (33)

Slide notes

  • [t=8:19:42 / slide_01] Title plate: 'Context Engineering has Limits' over the Databricks long-context-RAG chart — average answer correctness vs context length across gpt-4o, claude-3-5-sonnet, claude-3-opus, claude-3-haiku, gpt-4-turbo, gpt-4, meta-llama-3.1-405b, meta-llama-3-70b, mixtral-8x7b, dbrx, gpt-3.5-turbo. Citation: databricks.com/blog/long-context-rag-performance-llms. This is the lede slide — 'frontier models degrade as context grows, therefore deterministic tasks need a different architecture.' [slide_03] 'Task specific DNNs + CNNs + Transformers' — architecture diagram showing prompt → pre-processor → task-specific encoder stack (DNN/CNN/transformer branches) → routing → LLM decoder → structured output. This is the central architectural claim of the talk. [slide_05 / slide_08] Live demo of Virginia commercial driver's license — input image used for OCR + face detection + age verification. [slide_10] Interfaze output: bounding boxes drawn on text fields (name, DOB, address, ID number, expiry) AND two face-detection boxes (main photo + small specimen ghost photo). Age field computed from DOB. [slide_12] Side-by-side: Interfaze (left) vs Reducto (right) on the same license. Reducto gets text but is missing bounding boxes for several fields and the secondary face — concrete win for the encoder-first architecture. [slide_13] LinkedIn scrape demo output for Garry Tan — JSON with first_name, last_name, location (San Francisco, California, United States), latest_education (Stanford University), current_job (President & CEO at Y Combinator), followers (315,936), and full experience array including Garry's List (Founder, Jan 2020-Present), Y Combinator (President & CEO, Jan 2023-Present), Initialized Capital (Board Partner & Advisor, Jan 2023-Present). [slide_15] Dense PDF demo — the Interfaze paper itself, 'Interfaze: The Future of AI is built on Task-Specific Small Models' (arXiv:2402.XXXX style header visible) being used as the translation source. [slide_19] 'Open-Vocabulary Object Detection' pipeline: (1) input image → (2) text queries (laptop, coffee mug, potted plant, smartphone, notebook, pen) → (3) encode image and text → (4) contrastive scoring head computes similarity between image regions and text queries → (5) oriented box head + output with per-object confidence scores. The non-YOLO claim — detection driven by natural-language prompt. [slide_20] 'Pixel-Precise Segmentation' pipeline: detection boxes from step 1 → image encoder → prompt encoder → mask decoder → instance mask head, producing per-instance masks with confidence (laptop 0.96, coffee mug 0.92, potted plant 0.97, smartphone 0.95, notebook 0.93, pen 0.91). [slide_23] Translation demo result side-by-side: source English PDF on left, Interfaze Hindi translation in middle, Claude 4.7 Opus failure on right ('Failed after 3 attempts. Last error: Cannot connect to API: getaddrinfo ENOTFOUND ... maxRetriesExceeded'). Visual proof point for the long-horizon multilingual claim. [slide_25] 'Same Latent Space, Task-Specific Adapters' — the unifying frame. Vision, audio, text encoders all feed a shared decoder via task-specific adapters. [slide_30] Closing personal slide: Harsha + co-founder (Yoan) in front of the Y Combinator sign — YC-backed signal for cap-table / watchlist purposes. [slide_33] Final slide: Interfaze logo + two QR codes (X: @khurdula, email: harsha@interfaze.ai) + 'Thank You!' — contact capture for outreach.

Full transcript (1947 words)

Good evening everyone. My name is Harsha. I'm the co-founder and CTO at interface. We are a research lab that is reinventing transformers. Today I want to talk about how we managed to build a new architecture for deterministic developer tasks. Now it is no mystery that in the past two decades AI has gone from being a rigid machine learning model to a larger scale generalizable uh intelligence which you can use today for AI workflows.

We've gone from building uh structured fine-tuned models to today prompting that allows you to build agents. More specifically, think about this early 2010s to 2015s. You're a bank. You want to do OCR. How would you go about it? You would have to purchase or procure large data sets. Not only that, get a talented team to build that model, deploy and then maintain it.

This could easily cost you about a few hundred to even millions of dollars. Thanks to invention of large language models, we are able to do that with prompts. However, there's still a problem. Problem of hallucination though models like GPT are now massively multimodel and we are seeing it with Gemini they still hallucinate this happens because of context drift when you want it to be behave deterministically for large inputs of data you see hallucinations happen we at interface are solving this exact problem by designing a new architecture that we train so we bring the uh rigidity of a large language sorry machine learning model and the flexibility of a large language model.

So how do we do go about this? You use machine learning models as strong encoders for very specific tasks and then you use large language models to create the decoding phase of it. Today I want to showcase a few things as to what this model can do. I want to quickly showcase three things. I'm going to talk about it.

I'm just going to quickly run it so that we have time to talk about it. So first thing this is a real document. I want to extract data from it. Not only the text but I also want to detect the faces on it and also calculate his age to verify it. So we run interface for this. This is what interface gave us.

Not only did it extract the text, it gave you the bounding boxes of where it saw the text in the image, the actual pixel coordinates. It got both the faces right. And more importantly, it managed to calculate the age correctly. That's right. Now, let me show you one of the specific model providers or OCR providers that also does OCR.

That is Redu. A lot of you might have heard about it. Reductor did extract the text correctly, but it failed to do the other parts of it. Detect where the text is and calculate the age. Now, this happens because of a stronger encoder. Let's go to the next one. We want to scrape this particular LinkedIn page.

Surprised that Gary doesn't follow me yet, but okay. So, we want to extract Gary's experience. LinkedIn can be a pain to scrape because of the blockers and bot checks they have. I want to extract his experience beyond this button. Now, that's going to be interesting. So, let's see what interface did. Not only did it give us what it saw on the first first page, but it goes all the way back to his internship.

We are able to do this because of our own script model which is able to scrape uh LinkedIn. And lastly, I want to go about uh a PDF, a dense PDF. Sorry. Uh so I just have to run this again. So on this screen you're seeing a dense PDF that is supposedly a research paper for this particular model. We want to extract this entire text and translate it to Hindi and also count the number of characters in this PDF.

As it runs, I want to go back to the presentation cuz that's going to take time and then talk about it. So now that we saw the demo as to interface can do, I want to talk about how we managed to do it. I want to talk about what we actually trained. How did we do OCR? Before that, I want to showcase where we stand as well.

On your screen. This is M OCR bench which tells you how good a model is at handling complex documents not only from research papers but also complex handwriting for massively multilingual uh OCR. We are number one when compared to even specialized models like Chundra OCR or even specific providers like Redu. This is the example that you saw and this is the output that you saw.

What is happening under the hood is that this image is fed as input to the encoder that we trained which is at CNN stack that tells you the text regions. Each of these text regions becomes the uh becomes the location of crop. So you crop the image from where the text is use that to give it to the decoder to gen generate the output. Now this gives you confidence scores.

This gives you bounding boxes and metadata that you can actually trust beyond just simple text. We can go a step forward and feed that information to a larger model which is a decoder that we also have conditioned upon to get extract structured output. That's where the age aspect came from. You get the information and then you condition on top of it.

That is OCR. Now I go to object detection. How did you manage to detect faces? Now that is object detection with natural language. YOLO models are great but they only detect specific objects that they are trained for. We are number one for natural language object detection. Meaning you pass in a prompt. Let's take this room.

I give a picture of what I'm seeing in front of me and I say detect everybody who's wearing a black t-shirt. Interface would be able to do this. That's a complicated thing to do. How are we able to do that? So you take the same image, you have a text encoder which is encoding the text aspect, understanding what the user wants.

You have an image encoder which is understanding or representing the image in positional aspects and then creating a contrastive segmentation meaning it is pulling pixels which are closer to each other allowing you to detect the objects accurately. If you use that information a step further you can now segment those pixels. Same thing image encoder, prompt encoder, and then you have a mask decoder that will classify all pixels to give you a latent mask. ASR multimodality is a huge thing.

Not a lot of models support speech out of the box and I want to talk about it today. We we are one of the fastest models when it comes to ASR and we also have the lowest VR per error rate. So how do we do it? So when you give alarm form audio, we first detect wherever the speech is happening and then crop those audio clips.

So we get the chunks and then use those chunks to extract acoustic features for an encoder which is also trained to extract embeddings for feature features. Now these embeddings are used to cluster. Clustering allows us to segment uh segment features into groups and that gives us dization output. So now you know which audio is by which speaker but the text comes from the encoding part again where you convert the audio into a spectogram.

A specttogram is basically a visual representation of audio and then you use that as a frame to generate or classify text. So whatever the pronunciation is that would particularly be classified into text. So before going to the next thing let's see what interface gave for translation. So for this you can see that interface not only managed to extract all the text and translate it to Hindi but it also stayed relevant and safe where it's not supposed to.

Like it did not translate addresses, it did not translate author names and it also correctly calculated number of characters. Now we put this against claude 4.7 opus to see what claude would do. We gave it three tries and that's why I went back to this. Claude failed all three times. It did because of a timeout.

But even if it were able to and if it is a long horizon task, there's a problem with multilinguality, especially with South Asian languages. Let's go back. So we saw three things vision, audio, and text. While working with these three encoders, we train these adapters to work with the same decoder. So you would get accurate data, but you know where that data is being extracted from.

You could solve multimodality this way. Today I'm super excited to showcase our numbers as to three modalities that I was just talking about. We compare these we compared interface to models that you would traditionally use in production and these models are economical and can do the tasks in one shot. But we are comparing them for deterministic tasks tasks where there is only one output.

If you're looking at an image, my name cannot magically change. It's going to be still hersa. Yoan and I and my team have been researching about how do we build uh task specific models for about a year now. We did the same things. We picked small language models. We procured large data sets for a lot of money and we kept running into the same problem of determinism.

Models hallucinate. That's where we thought we have to go back to the board, redesign the architecture and rethink it. We observed that data is not the bottleneck. The architecture is and that is what interface is supposed to solve. Lastly, it's been such a pleasure to speak in such a amazing audience and such a beautiful country.

Thank you interface. Honestly, those benchmarks were so impressive. Thank you. That was incredible. Um, what fantastic benchmarks. What's up, Usman? &gt;&gt; Hello. &gt;&gt; How's it going? &gt;&gt; Great. &gt;&gt; How's it going, everybody? &gt;&gt; You know, I swear to God, you as an audience, you make me feel like Michael Scott.

You watch the office. You know what I mean? I'm just here. Am I entertaining you? And you're like, "No, I'm ready to go home." Don't be ready to go home. It's not time yet. Okay. I need you to like be inspired. Are you inspired? &gt;&gt; That's better. That's better. Usman, what's up next? &gt;&gt; Well, now we have um some guy named Harishi.

Fun fact, &gt;&gt; this is so awesome. This time he's actually based in Singapore and &gt;&gt; Singapore tech. &gt;&gt; We love Singapore. &gt;&gt; We love Singapore. &gt;&gt; Yeah, &gt;&gt; that's it. Energy over. &gt;&gt; Um, &gt;&gt; continue, please. &gt;&gt; Okay. Um, he his uh his app is actually based on how his personal mistakes with AI and especially coding.

I'm pretty sure all of my vibe coders here can relate as to how much mistakes or errors or bugs we've all come through and push. &gt;&gt; Look at this wallpaper too by the way. &gt;&gt; Oh wow. &gt;&gt; Right. That wallpaper is how you know it's going to be a banger. It's so cool. Are you ready Hi, &gt;&gt; you are good to go.

Hey everyone, again your biggest round of applause for Hish.

No Country for Old Code: high-context agent runtimes, declarative budgets, and the discipline of building agents that get to become legacy

Hrishi Olickel, CEO of Southbridge AI (previously co-founder/CTO of Greptile; widely-read AI engineering blogger known for the 'how to leave Greenfield' framing) — talk derived from his AIE unconference session, billing Hancho/Sentinels as Southbridge's open-source high-context agent runtime.

Summary

  • Frames the problem as 'No Country for Old Code' — most real agent work isn't greenfield coding agents but agents inside large existing systems with legacy code, organizations, and data; the operating posture is repair-over-rebuild, update-over-create, old-over-new.
  • Names five primitives that fall out of those priors: (1) prefer simple reusable units of work over one-shotting context windows, (2) remove things from context instead of adding, (3) separate control flow from prompts and prompts from code, (4) calibrate for behavior not stepwise success/failure, (5) build cost-aware systems that separate build-time and runtime so resources percolate.
  • Southbridge thesis: GPT-3.5-turbo was the last unit of general intelligence needed; everything since has been data-ingest connectors that self-heal, plus ETL for healthcare/finance/energy — 'we're beginning to solve ingest as a horizontal category.' Data-first agent work starts on the critical path with reliability as a day-one baseline.
  • Distinguishes online (human-in-front) vs offline (no human) agentic systems; argues most real systems are mostly offline, and 'agents can function like appliances' — fill the dishwasher at night, run thousands of reliable repetitions on cheap local models overnight.
  • Calls the coding-agent loop the future base substrate for agentic work — not because all agent work is coding, but because it concentrates RL, deployment pressure, and universal primitives (read/write/edit/shell). Analogy: V8 became the substrate for software that wasn't websites; coding-agent harnesses will become the engine layer for agentic work.
  • Hancho is Southbridge's open-source runtime built on these priors. Atomic units of work are called 'codons' — small, composable, reusable; chain them to get behavior. The bottleneck on long runs is the human's ability to reason about what happens at hour 20 or 25, so codon decomposition is what makes long-horizon agents tractable.
  • Sentinels are the runtime's most powerful primitive: LLM calls that trigger on event combinations from the primary loop, template their context, write a result to a file. They wake every N tool calls, summarize, sleep — far better than hooks for coalescing behavior (catching laziness, mocking, bad data hygiene, file rights, shell errors). Define the pattern once; fix it in the main thread.
  • Single primary agentic thread; explicitly no parallel agents — reliability gains outweigh parallelism. Sentinels are the event-loop hack that replaces multi-agent fanout.
  • Declarative budgets: long-horizon systems must be cost-aware on every axis (money, tokens, time, data access). Express resource distribution at build time; runtime knows what's available and the system solves for the gap. Declarative wins because models/harnesses/details change weekly and you avoid rewrites — same logic that made SQL durable.
  • Northstar reframe: stop shipping tools, start shipping outcomes. Agents need to become infrastructure — boring, repeatable, predictable, the kind that 'gets to become legacy.' Hrishi: 'it's only in code that legacy is a bad word' — and reading the millions of words of AI-generated output yourselves is Southbridge's actual superpower (9 of 10 failures = wrong shared abstraction or context that should've been deleted).

Slides (10)

Slide notes

  • [t=8:31:07 / 30667s] Slide 1 — Title: 'No Country for Old Code' (subtitle: 'or orgs or data'). Tagline: 'About building reliable systems that outlast us.' URL: southbridge.ai. Speaker: Hrishi Olickel, CEO Southbridge. Lower-third caption: 'No Country For Old Code: An open-source runtime for high-context agents.' [t~8:38:00 / ~31080s] Slide 2 — Hancho primary thread visualization with chronological event log overlaid on a rocket-engine illustration. PRIMARY THREAD (chronological) shows interleaved entries: 'assistant_action plan ext...', 'tool_result grep parser', 'file_write parser/file.ts', 'assistant_action fix tigh... wide case', 'tool_result edit:file parser/...', 'file_write to.ts', 'codon_boundary dev_loop / iter_0', '---', 'assistant_action check what we just changed', 'file_read .hancho_eve/sentinels/curator.md', 'assistant_action continue with type fixes'. QR code linked to hancho.southbridge.ai/sentinels. This is the slide where Hrishi introduces sentinels as event-triggered LLM calls writing summaries to file. [t~8:42:00 / ~31320s] Slide 3 — Closing slide. Headline: 'build things that get to become legacy.' Two QR codes: left = 'no country for old code' linking southbridge.ai/blog/no-country-for-old-code; right = '@hrishioa' linking x.com/hrishioa. This is the kicker after 'it's only in code that legacy is a bad word.' [t~8:43:00–8:44:30] Slides 4–7 — speaker-cam frames over the Singapore skyline backdrop during Q&A/closing thanks and walk-off; no slide content. Skipped. [t~8:44:30–8:45:30] Slides 8–10 — handover to next speaker (MC banter about 'MCP vs CLI', Monday.com MCP anecdote, display-extend tech-support comedy). Slide 10 shows the 'CALCLAVIA' username/handle and the AIE logo as Henry Mao (smithery.ai) is being introduced for the next talk. Not Hrishi content.

Full transcript (2586 words)

So that actually was a custom version of Bliss that we made from a talk that I gave at the unconference called how to leave Greenfield. So if you don't know Bliss, at least you know Greenfield. So this is welcome to no country for all code, right? And it's a working title. I think everyone keeps changing titles all the time.

So it's not a talk about coding agents. It isn't a talk about agents for coding. It's a talk about building agents inside large existing systems, right? With old code, organizations, and data because that's what we end up doing, which is about repairing over rebuilding, update over create, about old code and organizations over new.

And it turns out if you start from those base priors, a bunch of different primitives fall out, right? You prefer simpler reusable units of work instead of trying to oneshot context windows, right? You remove things from context instead of adding things. And you separate control flow from prompts and prompts from code.

And you calibrate for behavior instead of step-wise success and failure. And you build cost aware systems that separate build and runtime so you can percolate resources effectively. And turns out if you do all of those things well, you get to ship outcomes and you get to do things once and have them stay done. You get to fix things and that break and have them stay fixed.

And you get to vibe when you want to, right? Which makes it so much more fun. So that's really that's the bulk of the talk. I'm just going to spend some time explaining that but if that's good uh we can go right into it. So before this I spent a few years in electronics and software and where the bottleneck was like always data and it was getting it into a shape where it can be useful for a decision and after sort of thinking about it for a decade I started Southbridge with that conviction that 3.5 turbo was this unlock right that last unit of general intelligence that we needed and we could build the rest since then we've built connectors for data systems that self-heal regenerate we've built ETL systems for healthcare financial energy We're beginning to solve ingest I think as a species but also as a company.

Ingest as a horizontal category right whether that's for new customers, new data sets or even user uploaded data. And everything that we've done since we got founded was in service of solving that first mile problem for data with AI. But the problem with starting with data though is like your difficulty from day one is turned up to 11, right? Because you start in the critical path and your work needs to be long horizon from day one and like reliable as a baseline.

A full run on even small data which is like a gigabyte right verifying formats, validation, resolving entities all take like millions of operations and that those errors stack up. Context windows if you remember Gemini going from 2 mil to 1 mil actually started going backwards right but even if they went like a 100x we would still have way more data in like a day than what you can process. But then again, the most important sort of biggest killer of data companies that I've seen is diversity, right? Data as a stack as a whole is very very diverse.

Both in the macro and also the micro, right? In the micro, humans, us as a species, turn everything in we can into a canvas. Documents, Excel sheets, PDFs, like the like internally we have the joke that the the merge cell button in Excel was one of the greatest crimes against humanity. And in the macro though, companies really are unique, you know, TM snowflakes because you've got different stacks, programs, SOPs, security boundaries.

Even the same database like little Postgress viewed through different internets and permission systems looks like completely different systems. But one important separation that I want to make here is between online and offline agentic systems, right? And it's a it's a useful way to think about these things. Um like online versus offline is things with a human in front of them and things without right and as much as like I want to we want to stroke our egos most real systems have far larger offline components and online ones especially all the ones that we've worked on right you only really need an active latency sensitive human in the loop if you build things fresh every single time like if you can build reliable systems that oify over time and record your preferences like all of that work can move offline to run overnight on local models for cheaper and that agents can function like appliances.

They can do a job repeatably thousands of times, right? You fill your dishwasher at night before you go to bed. The next one is that I we still believe that coding agents are going to become the base substrate for agentic work, right? Not because all agentic work is coding, right? In fact, I think we'll saturate on coding very very soon.

But because the coding agent loop is becoming the thing with the most amount of resources, the most RL, the most deployment pressure, and it's got universal primitives, read, write, edit, shell, right? And in the same way that V8 and browsers became the substrate for a crazy amount of software that wasn't actually websites, coding agent harnesses, we believe, will become the engine layer for a lot of agentic work. Okay, so that's enough about the the general structure of things. What did we actually learn?

Right, the first thing was to stop pushing one shot, right? Single shot performance I think can be crazy fun to push when you're building things and like same here like you know complex instructions, long plans, giant skills. I think Sabina was talking about fries and more fries in the back compaction. But repeatable work, which is where we said runs counter to all of those instincts, right?

It just is not how you want to build. If you want self-driving agents, right? Because the first thing you want to do is break things into small atomic pieces and in Hankqu which is this runtime that we use and we've used for a long time and sort of recently open sourced those small little boxes are called codons, right? You chain those to then get the behavior that you want and you make them reusable and composable.

And if you break it down this way, it makes it so much easier to reason about long runs, which ends up becoming the bottleneck. Your ability to reason about what happens at hour 20 or hour 25 like you the human ends up becoming the bottleneck to you building complex software right the next thing is to remove things from context right I'm still surprised at how few harnesses systems uh just frameworks out there have a way to remove things from context right like the default behavior we've always had is have boundaries that delete context and archive what you don't need right preventing this thing that internally we've come to call world line rot which is you know Ted Lasso says be goldfish ends up being a good thing. The next one there is to just separate separate components by type. Like as an industry, we keep having to relearn this, right?

Back when I was in college, we had van Harvard architectures with like code and data separations becoming a thing. And then later on we had PHP and like CGI and it took us another four years to learn that you had to separate model view and controller. And agentically, same story, right? You if you want to build reliable systems you want to keep these five things as separate as possible like data promps control and the rest right and in the last year we've worked with a lot of people we've touched a massive amount of information we've read millions of words of AI generated results like I I like I said that's our you know superpower which is that we read the outputs we read the outputs for you and we read everything that comes out of these things nine times out of 10 if something breaks it's because there was a wrong abstraction shared shared between you and the agent or because something was left in context that just did not need to be there.

So going into a bit more of our things, right? Like we usually build on the principle that the best part is no part, right? So simple tools sequence work like we talked about and you only add things if you absolutely have to. So I hope it's not a surprise when I say that we've never actually needed parallel agents, right?

A single primary agentic thread for us in the line of work that we do for reliability has way too many benefits to give up, right? So many programming languages, Python, JavaScript, a lot of them will agree and we'll look at some of the benefits on our side in a second. But for our version of the event loop, that little hack is what we call sentinels. So we initially designed those things to monitor long agentic runs, but they've become our most powerful primitive.

So sentinels are LLM calls that trigger on some combination of events from the primary loop, right? They trigger, template their context, and then write the result to a file. A sentinel could wake up every 50 tool calls, summarize what happened, and then go back to sleep, right? But turns out they're amazing at catching behavior without creating so much complexity that you have to troubleshoot the the eval system.

So laziness, mocking, bad data hygiene, file rights, shell errors. You define the pattern that you want in something reusable that we call the sentinel and then you fix it in the main thread. Right? Way more than hooks. This is far far better for us for coalesing behavior. So I'll do one more just one more which is budgets.

Right? Long horizon systems on our side just need to be cost aware on every axis that matters right but if you do everything that I said so far you can make a declarative budgeting system which is really the best kind like SQL. You can express what you have and the system figures out the gap in between. Right?

In fastmoving spaces like AI where models, harnesses, implementation details change all the time, declarative actually wins because it keeps you from needing to rewrite things. So we've got all of the different axes, money, tokens, time, data access even in the right time. You express at build time how those should be distributed. And at runtime, you actually know what resources you have.

So you can solve for the two things, right? So finally, if you do all of these things, you can ship outcomes instead of building tools, right? And I say this to a room full of people, me included, who care a lot about the craft, who care about the tooling, right? But most people, they don't care how their dishwasher works.

They don't care how their car injects fuel. They want clean dishes. They want to get where they're going. Like, so our northstar has always been to deploy systems that ship the outcome, right? Which might be getting a customer on boarded as quickly as possible, validating research hypothesis, cutting integration time, right?

Or just doing all of that without embedding what we call Achilles into your data. And for that agents need to become infrastructure. They need to become boring, repeatable, predictable. And so that is really just the goal for us, right? To build things that get to become legacy. It's only in code that really legacy is a bad word.

So in some ways you're trying to bring that back. So lots of things that couldn't go into the talk, but uh you can go here for for the long version. Thank you guys. Woo! Ah, Hershi, thank you so much. That was such a great talk. You know, I got to talk to Hishi backstage and I was already prepared. Wow, what an incredible talk.

One more round of applause for Hishi, everybody. Oh my goodness. Incredible. Our next talk is is is another exciting one. I went backstage and asked him, I said, "Hey, what's your talk about?" And he said three words. He literally just said three words and nothing more. No more words were spoken, Henry. Um, the words were MCP versus CLI.

That's that's the talk. And I'm really excited about how many of you um use MCP on the daily. Almost everybody. Wow. What do you use it for? You over there with the glasses. What do you use it for? Debugging production. Awesome. That's actually a good use case. We um internally where I work, we use a project management tool called Monday.

Anyone use Monday here? Monday monday.com. Um it's I'm not going to say anything. Anyway, um they they have a UI like a web UI, but they also have an MCP server, which is so amazing because I can be working on something in in cursor, my preferred IDE, not Spawn. Um and and I have the Monday MCP server inside and I can just say I'm going to this conference added to Monday in the agent and it just does that and it's so cool.

So I'm a huge fan of Team MCP. Um but of course CLI also have reason to exist. I mean, Claude Code um is a is a CLI agent, a coding agent with an MCP client functionality, right? And so, how does this land? Well, we're going to find out. Henry's just setting up here, and in a minute, we're going to hear about MCP versus CLI, which may not even be a versus.

It could be an MCP and end CLI. Um, do you think CLI is kind of going out of style? Anyone? No. Yeah, of course not. Because if we don't use it, agents will use it. I think it's a fantastic user interface. I'm slowly running out of things to say. Oh, good. Look at that. Hey, listen. We're almost at the end of the conference.

This is going to be a great talk. Give your biggest round of applause for Henry &gt;&gt; Mau. &gt;&gt; No, &gt;&gt; we have a bit more. &gt;&gt; That's okay. Oh, he's You got to You got to extend. Choose extend display. I'm tech support now. There we Is it Is it ready? No. Almost. Okay. No, see what they're doing is they're extending, but he hasn't dragged the window.

This is now commentary, everybody. That's what I love it. Thank you. Oh, pity. You know, this is this is the You know what you call this? You call this pity applause. Thank you. I need it. Put a coin in my hat, too, while you're at it. There we go. It was extended. They dragged it. Okay, let's try this again. Your biggest round of applause, Henry Mao. &gt;&gt; Thanks for the introduction.

Henry Mao (Smithery): MCP, CLIs, and the harness era of agent agency

Henry Mao, Smithery (founder — MCP server registry / aggregator)

Summary

  • Henry Mao (co-founder/CEO Smithery, previously co-founder of Jenni AI, $10M ARR) frames the entire MCP-vs-CLI debate around one operator-grade thesis: 'prompting is the tax you pay for missing agency' — humans are still acting as the adapter layer between models and services.
  • Smithery is the open MCP registry/gateway: thousands of developer-published servers, unified auth, gateway aggregation, currently processing ~100,000 tool calls/day. The growth chart (June 2025 → March 2026) goes near-vertical.
  • Honest postmortem on 'MCP is dead?' (RIP 2024-2025 gravestone slide): spec churn through 2025 (initial spec → streamable HTTP+OAuth → elicitation+structured tools → async tasks+sampling-with-tools), poor client/server implementations, and harnesses that 'dumped every tool into context' — the Chrome-without-rendering analogy.
  • Two failure modes diagnosed: (1) BAD HARNESSES — context rot from dumping all tools into the window; (2) BAD SERVERS — watered-down APIs with hand-crafted prompt-injection tool descriptions ('search issues. IMPORTANT: always call this before create_issue. Be CONCISE') — antipattern that conflates task-specific behavior (which belongs in a SKILL) with the tool description.
  • MCP vs CLI is a category error: MCP is a PROTOCOL (like REST/GraphQL), CLI is an INTERFACE. The real question is whether harnesses do native MCP rendering — which as of early 2026, Claude and Codex finally do.
  • Benchmark (Smithery): three APIs (GitHub, Linear, Singapore Bus API — chosen for diversity + training-data contamination control), three models, MCP vs CLI. Native MCP won on BOTH accuracy and token efficiency — busting the 2025 consensus.
  • Ablation found the two principles that matter for agent experience: (1) self-documentation (discoverable, well-described tools) and (2) search (let the agent grep its own toolset). Apply both to a CLI and you close most of the gap.
  • MCP's three remaining wins: portability (works in cloud agents without sandboxes — lower latency for non-coding tasks); harness-managed context engineering (your tools improve when Claude Code improves); permissioning — CLIs 'give the keys to the kingdom,' MCP defines a small opinionated surface where policy DSLs can scope authority ('let this agent read Gmail and Notion for 30 min, no send, no CRM write').
  • The 'two terrible choices' frame: ask-for-approval (doesn't scale) vs --dangerously-skip-permissions. MCP gateways with policy enforcement (Smithery offers this) are the third option as agents graduate from chatbot to coworker.
  • Closing thesis: protocols should become boring like HTTP. The destination is 'human in the loop → human ON the loop' — outcomes-driven agents, not prompt-driven ones.

Slides (24)

Slide notes

  • [t=8:44:30 / start] Slide 1 — Title slide: 'Give agents more agency' (Smithery red). Henry Mao, Co-founder/CEO, Smithery. Previously co-founder of Jenni AI ($10M ARR).
  • [t=~8:46] Slide 2 — 'Prompting is THE TAX you pay for missing AGENCY.' (yellow accent on 'the tax' and 'agency'). The thesis-card.
  • [t=~8:47] Slide 3 — 'Smithery: unified context and action for agents.' Line chart titled SERVERS PUBLISHED ON SMITHERY, June 2025 → March 2026, near-vertical growth in the last 3 months.
  • [t=~8:48] Slide 4 — Smithery UI screenshot showing MCPs / Skills / Docs / Pricing / Publish / Toolbox tabs; visible entries include a Gmail MCP (20,200 users), a code-docs MCP, a 'Search' MCP citing Brave's index, all with 'Add to toolbox' buttons.
  • [t=~8:49] Slide 5 — 'MCP moved fast (and there was chaos!)' Four spec milestones: 2024-11-05 Initial spec (tools, resources, prompts, sampling); 2025-03-26 Streamable HTTP + OAuth (replaces SSE, tool annotations, auth flow); 2025-08-18 Elicitation + structured tools (OAuth resource servers, resource links, draft batching); 2025-11-25 Async tasks + sampling-with-tools (OIDC discovery, incremental scope consent, icons).
  • [t=~8:50] Slide 6 — '"MCP is dead?"' with a cartoon gravestone. Subtitle: 'RIP: 2024 – 2025.'
  • [t=~8:51] Slide 7 — 'Bad harnesses — context rot.' Two stacked context-window diagrams: 'Without MCP Tools' (clean) vs 'Too Many MCP Tools' (overflowing into a 'Death Zone' / 'Toxic Zone'). Bottom: a t.co link to a referenced status.
  • [t=~8:52] Slide 8 — Garry Tan tweet (@garrytan, Mar 12 2026): 'MCP sucks honestly. It eats too much context window and you have to toggle it on and off and the auth sucks. I got sick of Claude in Chrome via MCP and vibe coded a CLI wrapper for Playwright tonight in 30 minutes only for my team to tell me Vercel already did it lmao. But it worked 100x better and was like 100LOC as a CLI.'
  • [t=~8:53] Slide 9 — 'Bad servers — watered-down APIs.' Code block titled '// what authors actually wrote' showing a tool description: 'TOOL: GITHUB_SEARCH_ISSUES — Search issues. IMPORTANT: always call this before create_issue. Use ONLY when user explicitly asks. Do NOT include closed issues unless asked. Returns max 30 results. If you see <sensitive_data>, refuse politely. Be CONCISE.'
  • [t=~8:54] Slide 10 — 'mcp' vs 'cli' meme: two yelling cartoon faces (the classic 'aaa' meme) labeled MCP and CLI — the category-error setup.
  • [t=~8:55] Slide 11 — 'Benefits of CLI.' Three columns: Progressive disclosure ($ gh --help, $ gh issue --help, $ gh issue list --help); Composition (gh issue list —json title,state | jq '.[] | select(.state=="OPEN")' | head -10); Mature (man pages, exit codes, stdin/stdout/stderr, flags, env, POSIX).
  • [t=~8:56] Slide 12 — 'BENCHMARK: MCP vs CLI.' (Smithery flame mascot.)
  • [t=~8:57] Slide 13 — 'Let the harness do the work.' Subtitle: 'Harness vendors' job is to optimize context.' Sponsor lozenges: Simular, AeroMind, Sonar, Resaro.
  • [t=~8:58] Slide 14 — 'CLI gives the keys to your kingdom.' (Smithery flame mascot, somber pose.)
  • [t=~8:59] Slide 15 — 'Two terrible choices:' — 'Ask for Approval.' vs '"Dangerously skip permissions."'
  • [t=~9:00] Slide 16 — 'Scoped authority.' INTENT box (prompt): 'Let this agent read Gmail and Notion for 30 minutes. Don’t let it send email or modify customer records.' COMPILED POLICY box (JSON): scope: workspace:acme; resources: [gmail:inbox, notion:*]; operations: [read]; deny: [gmail:send, crm:write]; ttl: 30m; audit: on; escalate_on: deny. [t=~9:01 / closing] Slide 17 — 'Human in on the loop.' ('in' struck through, 'on' emphasized). Henry Mao — co-founder, Smithery. henry@smithery.ai. [Slides 18-24: speaker-cam / transition frames for the Raj Singh handoff — no slide content.]

Full transcript (2168 words)

My name is Henry. Uh, hey everyone. I'm the co-founder of Smithery. Uh, today I'll be talking about the ecosystem of MCPs, CLIs, what we've seen here from Smithery and how that relates to giving your agents more agency. So, a little backstory. Uh, at my previous startup, Jenny AAI, we built an AI academic co-pilot for academic researchers.

And one thing that really bothered me when I was watching users use our product was that they would often have multiple windows open. Uh they would be using different apps along with track GBT and they will waste a lot of time copy and pasting between these apps and their AI AI of choice. And this is a broader problem that affects every single knowledge worker. Whether you're hopping between terminals, between your coding agents, or jumping between your CRM and Google Docs, we are stuck in a sort of copy and paste hell because humans were essentially acting as the adapter layer for AI.

You were in the loop prompting the model for every single read and write access to different services. And prompting is really the tax that you pay when models can't access your data or take action on your behalf safely. And that tax is pretty expensive. So I started Smittery about a year ago to tackle this problem.

MCP just came into the scene and I saw it as a way to help bridge the gap between agents and services. So we started Smidy as an open MCP registry and we tracked a community of thousands of developers who published their MCP servers on us. We built uh a gateway that aggregated these services and unified authentication so that agents can conveniently access all your APIs grouped as a single toolbox. We currently process about 100,000 tool calls a day for our users.

But our journey wasn't smooth at all. Uh if we're being honest, uh MCB had a lot of hype after launch, but also had a lot of issues. The protocol was definitely ambitious. It tried to build a standard uh while agents were figuring out how to call tools well and it had to change its spec rapidly in early 2025.

The implementations of MCP clients and service were poor and that led to a lot of frustration with users. So by the end of 2025, I think a lot of people started proclaiming that MCP is basically dead just as fast as it exploded. In fact, at least five people in this conference, I think over the last two days have asked me the same question. Is MCB dead?

And we're going to get to the bottom of this because many of the criticisms criticisms that people have raised are valid. The main reason why people had bad experiences with MCP was that most harnesses back in 2025 had a very naive approach of adding tools into it into the model context. They simply dumped every single tool into the context window like this diagram on the right side. And imagine, you know, imagine you're browsing the web with Chrome, but Chrome did imagine if Chrome did not render HTML at all.

It just dumped raw HTML and CSS to you and ask you to figure out what to click. And that's what we were basically doing to models. A harness was dumping all the tools to the model and expecting it to do well. It gave the model information overload and instead uh instead of rendering a usable interaction layer.

So this wasted a lot of tokens. It caused context rot and it degraded model performance significantly. And to make things worse, many MCB servers uh built back in 2025 were poorly implemented and basically watered down versions of their official APIs. A lot of them didn't implement proper authentication.

Um and developers would handcraft these uh prompts basically in the tool descriptions to try to prompt inject weaker models. Uh these were all antiatterns that couple task specific behavior uh which should really belong to a skill uh into a tool description. So the lack of a good developer experience eventually led people to look for alternatives. Um coding agents got good at bash.

So the natural question people asked uh was why not just use the CLI. The CLI had many benefits. First the CLI had progressive disclosure built right into it. It had pipes so you can compose different subcomands together. Uh, and it's built on a mature Unix stack. But there is a hidden category error we're making here where we're comparing CLI to MCP.

MCP stands for model context protocol. So it's a protocol, not an interface. And comparing it to a CLI is a bit like comparing apples to oranges. And this diagram hopefully can explain it a little bit better because a protocol's job like REST and GraphQL is to define a standard of how to communicate, not necessarily to render uh define how tools are rendered to the model.

What was missing was a good harness that renders MCB well to the agent and we refer to this as uh native MCP rendering. The good news is as of early this year in 2026, major harnesses like Claude and Codeex have finally built proper ways to render MCPs. So we wanted to test this at Smittery. How do modern harnesses actually perform when they use their native MCP renderer versus Bash and CLI?

So here's the experimental setup we did. We ran a benchmark on three core APIs, GitHub, Linear, and the Singapore bus API. We chose these APIs because they represented a diverse set of um API styles and also uh training data uh contamination. We also chose three different models listed here. Um and the main thing we changed was the interface we provided to the agent.

So we either installed uh all these APIs as MCB servers on the agent harnesses or we provided a CLI to their bash interface. Our goal here is to measure accuracy and token efficiency. So, here's a question for the audience. Just a raise of hands. How many of you think that native MCP did better than CLI?

Okay, we got some people. How many of you think CLI did better than MCP? Okay, there's more people. And how many of you think it doesn't matter? Like, it's just a tie. Okay, we got some people here, too. So to our surprise, native MCP actually won in both accuracy and token efficiency, which really busted the the myth that we've been living with in the last year.

Um, and that's be that's mainly because the model harnesses have updated themselves and became more efficient. But what I was more interested in here is what are the principles of agent experience design that really matters like what made uh what can we do to make CLI better uh or what are the principles uh of a harness that actually makes MCP uh work so well. So we did some ablation experiments by changing the construction of our CLI to see if we can match native MCP's performance. Um so we did an experiment where we added descriptions uh better description to the CLI and we also did some experiments where we added um a search functionality to the CLI and what we found was these two things mattered the most out of a bunch of different things we tried.

First is self-documentation. So if you provide agents with discoverable well-escribed tools it will perform better. And the second thing is search. If you provide agents with ability to search through subcomands in a CLI or tools with an MCP, it performs significantly better because this reduces the number of steps it needs to find the tool for the job.

So, if you apply these two principles to your CLI, you can mostly close the gap in performance against native MCP. Full uh experiment experimental details are on our blog. So, at this point, you might be thinking, well, I don't really care about token cost. My company's paying for it. Uh or models will get cheaper.

Uh you know, the results are close enough. I'm just going to use a CLI. And you're not wrong, right? If you're an engineer running things locally, you should probably just use a CLI. I'm not being sponsored by MCP, by the way. Uh we ran this benchmark after Smitter launched our CLI offering. So, we can work with both.

But I do want to give MCP some credit uh where it's due. For one, CLI works if you want to set up a sandbox. But with a good harness, MCP works just out of the box. So these are use cases where you actually want to run a cloud agent um that is sandbox free. The reason you might want to do this is because it will be more lightweight and have lower latency for lightweight tasks that are unrelated to coding.

So portability is one advantage of the MCP. Another benefit is that MCP puts the responsibility of context engineering on the harness. So that means if cloud code updates and improves its harness and how it interprets tools, your tools will improve as well. But there's one more benefit of MSP that's a little bit more subtle and matters once you want to move towards a world where agents have more agency and that is permissioning.

Because the major weakness of CLI that we found is that it's usually way too broadly scoped because it's made for developers and it has a huge attack surface when you want to run it with little supervision. CLIs give the keys to your kingdom. So whenever you're running a CLI agent in the background in a longunning job, you're kind of stuck with two terrible choices. You either make an ask for approval, which doesn't really scale, or like most of you out here who are probably guilty of this, you're going to dangerously skip permissions.

And the one thing MCB has here is that it defines an opinionated small surface. So it makes it uh so it makes it easier for you to secure it. This choke point allows us to apply policies and guard rails to your agent. So for example, if you're using spitter's gateway, we provide a policy DSL so you can enforce fine-rained permissions on what your agents can or cannot do.

So this primitive gives you peace of mind as we graduate agents to full autonomy. So to answer the question, is MCB dead? I don't think so. But that's also not the point of this talk. MCP and CLI, in my opinion, both have their purpose, and it's a principle behind agent experience, security, and authentication that are here to stay.

MCP might no longer be in the zeitgeist. And that's fine because the best thing that can happen to a protocol is that it becomes boring like HTTP. Boring enough so that we can move on to solving more ambitious problems and push towards a world where agents are driven by outcomes, not prompts. Where agents can fully graduate from a chatbot to a co-orker.

That is how we move from humans in every single loop to humans being on the loop. Thank you. Come chat with me later outside if you're interested in wiring up your agent. &gt;&gt; Yes. Chat with Henry. One more round of applause, everybody. Henry Mau, we go from humans in the loop to humans on the loop. Honestly, I'm ready for that.

Look, listen. Our next speaker, I've been told, I don't I've just met him today, but I've been told he is, and I quote verbatim, the most cracked engineer in all of Singapore. You hear that? They So, look, I'm not even I'm not even qualified to introduce him. So, I'm I I need help. Ivan, Ivan, give it up for Ivan, everybody.

So Raj, I've had the pleasure of knowing Raj for quite a while now and it's absolutely incredible what he pulls off. We had a hackathon once. He came in and said, "Oh, I'm going to build a way for agents to collaborate." And so we he finished it and we said, "Oh, what else are you doing the weekend?" He said, "Oh, there's the Mistro hackathon. What are you doing then?" He's like, "Oh, I already built a tool to help me build my submission for the next hackathon." And then he won out Gemini hackathon.

And he almost won the Mistro hackathon. And then he said, "Oh, I've been hearing about this Kim 2.5 thinking. It's pretty cool." And I said, "Oh, that's nice." So, what did he do next? He post trained it himself and ended up beating it and using it as his main agent. Raj is absolutely incredible and honestly, I'm excited to hear what he has. &gt;&gt; Thanks, Ivan, for that

Rach Pradhan (Independent): Evolutionary Harnesses, Self-Improving Coding Agents, and the Bitter Lesson Coming for Tooling

Rach Pradhan, Independent researcher (introduces himself as 'Raj'). Builds open-source coding-agent harnesses and ancillary tools: codegraph (briefly SOTA on Terminal-Bench), muanry (faster ripgrep for agents), CodeDB (trigram code search), nanobrew (faster homebrew/apt for sandbox bootstrapping), agent-browser (CDP/A11Y-driven web automation), DevSwarm (multi-model parallel orchestrator). Origin story: a medical diffusion paper led him into open-endedness literature and evolutionary algorithms — he's now the closing-keynote technical voice for the 'harness is the selection pressure' worldview. Closes the entire conference; widely loved in the Singapore hackathon scene.

Summary

  • Frames coding agents as an evolutionary system: the sun is compute, DNA is the model weights, and the harness is the selection bias — three billion years of entropy gradient is the analogy for what we're now compressing into months of agent iteration.
  • Cites OmniEPIC (Jeff Clune lineage) as the paper that flipped his thinking: a single agent that started specialised in one environment and generalised across many showed emergent improvement; the follow-up where the environments themselves became code took SWE-Bench from ~20% to ~50%.
  • Core thesis: every company building a custom hand-crafted harness is doing it wrong. The right primitive is self-evolving harnesses (meta-harnesses, ROM-style memory) where the agent rewrites the harness and its tools on every loop — and the next layer is agents whose architecture is itself emergent, not human-designed.
  • Names a hierarchy of self-evolution: (1) tools evolve, (2) environments evolve, (3) harness evolves, (4) agent architecture evolves inside code-native 'world models' — not physical world models, world models of codebases.
  • Scaling laws hold 'as long as humans are more interesting than the agents.' Reasoning traces and trajectories are the artifact worth studying, not end-state weights — DNA-as-trajectory analogy.
  • Iteration loop speed is the real fitness function. Argues this is why he writes his tooling in Zig and Rust (small compile times), and predicts every serious company will ship its own meta-agentic DSL this year or next — not human-readable, optimised for the agent.
  • Tool stack he built for his own harness, all open-source: muanry (line-precise ripgrep), CodeDB (trigram search so agents get exact lines and avoid context rot), nanobrew (snapshot-faster than apt/homebrew for sandbox env resolution), agent-browser (uses Chrome CDP accessibility tree → fewer tokens than DOM), DevSwarm (orchestrates mixed pools of Opus + GPT context windows against rigid eval ground truths like Terminal-Bench and Legacy-Bench).
  • Codegraph is the harness output of that loop — was SOTA on Terminal-Bench for a window, no longer is, but the point is that it self-evolved there with no hand-curation. Trajectories are open-source; large-scale release pending.
  • Closes with: 'This year will be one of the few years where you keep seeing the bitter lesson, bitter lessoning.' Host calls it the mic-drop of the conference. Swyx joins on stage; conference wraps.

Slides (14)

Slide notes

  • Note: this talk is essentially slideless — 14 captured frames are mostly the empty title chyron + speaker cam, with one image slide. Substance is entirely in the spoken track; timestamps below mark spoken-content beats rather than slide transitions. [t=8:56:30 / 32190s] Title chyron, persists for most of the talk: 'Rach Pradhan / Researcher, Independent / Optimizing Reliable Agentic Workflows: MCP, Code Intelligence, and Parallel Agents' [t=~8:57 / 32220s] Speaker introduces himself: 'I am Raj and today I'll be talking about my journey in creating evolutionary harnesses as well as evolutionary algorithms in general.' Origin: medical diffusion model for chest X-rays → low-data problem → stumbled on 'human notions of interestingness' paper → LLM-as-judge for open-ended RL curricula → open-endedness rabbit hole. [t=~8:59 / 32340s] Sun/cat/galaxy image slide (slide 7): the only true content visual. Used to deliver the analogy — sun emits energy, life converts low-entropy photons to high-entropy photons over 3+ billion years, harness = selection bias = the gradient by which agents climb. [t=~9:00 / 32400s] OmniEPIC walkthrough — single agent, multiple environments, started specialised → became general → emergent behaviours. Follow-up paper: when environments themselves were rewritten in code by the agent, SWE-Bench jumped from ~20% to ~50%. [t=~9:02 / 32520s] Core claim: 'Every company is trying to create custom harnesses. I don't think that's the right way to go about things. What if you could then instead have self-evolving harnesses?' Names meta-harnesses, ROMs. [t=~9:03 / 32580s] 'Scaling laws still hold to a degree… as long as humans are more interesting than the agents or the harnesses themselves.' Predicts non-human-designed architectures, not handcrafted. [t=~9:04 / 32640s] Trajectory > weight: 'the artifact worth studying is the path and the reasoning traces and why a model did something and not the end state.' DNA-as-trajectory analogy. [t=~9:05 / 32700s] Iteration loop speed = fitness. Writes his tooling in Zig/Rust for small compile times. Predicts: 'Maybe this year or next, almost every company would start writing some sort of their own meta-agentic language… they don't have to be human readable.' [t=~9:06 / 32760s] Tool roll-call — muanry (precise-line ripgrep), CodeDB (trigram search, open source, agents get exact lines, kills context rot), nanobrew (snapshot/resolve env faster than apt/homebrew), agent-browser (Chrome CDP accessibility tree, fewer tokens), DevSwarm (mixed Opus + ChatGPT context-window pools, source of truth = Terminal-Bench / Legacy-Bench, telemetry compounding). [t=~9:07 / 32820s] Codegraph: 'was SOTA on Terminal-Bench for a while but it's no longer SOTA' — self-evolved there, created its own tools. Open-source code; trajectories open-source; large-scale release WIP. [t=~9:08 / 32880s] Close: 'This year will be one of the few years where you keep seeing the bitter lesson. Bitter lessoning. Thank you.' Host (visible in slides 10–11): 'My mind is blown. My mind is absolutely blown… he's so young and he built this thing.' Says one of Rach's open-source projects had unblocked his own work. [t=~9:09 / 32940s] Conference wrap (slides 12–14): swyx walks on stage; host pays homage to Agrim ('won the most hackathons in Singapore', conference organiser); AIE Singapore 2026 ends.

Full transcript (2049 words)

L. But yeah, I am Raj and today I'll be talking about my journey in creating evolutionary harnesses as well as evolutionary algorithm in general. So a little bit about how I got to this. Initially it was a paper that me and my friend were working on. We were thinking of how do we create diffusion models um from scratch and we're creating specifically like a medical diffusion model for chess acties.

And while we were working on it, we realized that there was very little data to begin with. And as we're going through different papers, we stumbled upon one paper that talked about um models having like human notions of interestingness. And that paper basically used like a language model as a judge for an open-ended like RL curricula. And it exposed me to the whole world of open-endedness and algorithms.

And that was my first time using that. And I think the next question that like naturally came out from that was essentially like if we claim that agents can be open-ended and that they keep producing novelty forever, how does that look like in our own like ecosystem in our own biology? And I think the sun is a very good answer at that. Um basically for like energy particles that come from the sun they basically come on earth and they get emitted back into space as well as higher entropic like photons and the gradient that basically enables this is all of life itself.

Life is the thing that creates more entropy and it's a very particular kind of entropy that took three billion years or even more to create and generate. And the question then was like how can we map this towards some similar like systems like agents itself. So that was what we I tried to do whereby it was like what if sun itself is compute the DNA that has evolved these smaller cellular single cellular organisms into the complex beings like us that write code that engage with code that can think that can react to things and create more entropy. That is basically the trajectory that these agents have.

Um and the selection bias itself is the harness. Um which basically evolves as models have been evolving. A very interesting paper that I read after that was basically a paper that showcased a single agent that slowly improved itself over time. It was called omni epic whereby you had different environments and the agent started out being very specialized in a c in a single environment and as time progressed it started becoming more and more general.

That generality of that agent made it perform tasks that were immer that that showcased emergent behaviors and that was a very interesting like feedback loop that then led to the creation of another paper that that same author wrote in which the code itself was when they replaced that to become the code. Um it actually illustrated a a significant in um improvement in performance when whereby the agent like from just performing it at 20% in Swenge it went up all the way to basically 50%. And that was when I realized that if you could evolve the environments that you place these agents in and you evolve the tools, um both are the things that you could have a lever on and that could eventually improve the overall agents performance. And if you look at the trajectory of everything like we've had models that are way better than the harnesses that we have, every company is trying to create custom harnesses.

I don't think that's the right way to go about things. What if you could then instead have self-evolving harnesses? There have been papers on that like meta harnesses, ROMs, and a bunch of other literature. And the next step to that will be the agents themselves. What if you could somehow keep that memory state somewhere else and evolve that agent?

What follows next will be things like world models, not physical world models, but world models that interact within a codelike environment or various code like environments that could be very differentiated. And something that I talked to my friend who worked on a pretty interesting world model paper was that what would be more interesting will be seeing how the architecture of agents within these world models look like. They may be novel and not handcrafted. It may not use the same techniques that we do but that's something that will be interesting to see and we are seeing that nowadays as well whereby initially the scale at which models initially grew up it took us a really long time to saturate MMLU and the other benchmarks but every few weeks you see a new soda model coming out and that's not because we have more better or just better quality data it's because training loops have gotten faster and models are just closing ing the loop themselves to a degree.

Um, and my point is that scaling laws still hold to a degree. They will hold and continue holding as long as humans are more interesting than the agents or the harnesses themselves. This could come in the form of like different architectures which are not handcrafted. They don't necessarily have to be humanmade.

And this is something that I believe will just remain. What I found out in my journey so far also has been that when creating the bigger meta harness that I made, what was what improved model performances generally was the trajectory. It was never really the weight. It's similar to how like the DNA just remain like and the way that we exhibit its characteristics change.

Um the artifact worth studying is the path and the reasoning traces and why a model did something and not the end state if Yeah, if that makes sense. Um, and another thing that I learned while building code graph was that iteration loops are very important to this. The most successful life forms are ones that like just adapt really fast, those who die really fast. And if you can close that loop faster, it just lets you do more things.

And that can come in many forms. A great example of that is language. What language are you writing your code in? I think for me a lot of my work right now has been around or written in zigg or rust but I realized that eventually like when you want to create better and better tools um languages that have smaller com times actually end up creating better tools and you can create better tests for these tools even if that language is not memory safe.

I do believe that eventually maybe this year or next almost every company would start writing some sort of their own meta agentic language and whatever happens these models would keep getting better and better and they don't have to be human readable. So these are just a few tools I built for myself that I've been using internally like muanry which is just a faster rip grab that enables my agents to get more context. Um the exact lines of code are retrieved. Code DB, this is fully open source as well.

It's a triagram search for my own harnesses as well whereby agents get the exact lines of code that they need to change so that they don't have like context rot. Um nanobrew was then created because once you start putting these agents up in the sandboxes, you realize that one way to get um coding environment set up, you could snapshot it. The other thing is you could just keep pulling like abt get and getting the packages and dependencies that you need. But I was like what if you make that faster as well so that you can resolve that environment and that's how nanobrew was born which is significantly faster than appget and homebrew itself.

And it was this was also another parallel tool that I realized I had to create for my agent to be better at like just navigating the internet. Something like agent browser but also using lesser tokens by using the A1Y like extensions that CDP or Chrome exposes to people and this actually improved the agents ability to browse the internet at scale. Finally, back to the whole evolutionary loop like Dev Swarm was made whereby in Dev Swarm what was orchestrated was basically a set of tools or models that can change their shape. So you could have like maybe a few Opus context windows coupled with a few chat GBT windows with a whole multi- aentic framework and the source of truth would be something more rigid like terminal bench or legacy bench and as more people started using this I started getting more telemetric data on what works and what does not.

So quick side is that all of these like lie into some sort of a fitness function which in a coding agent the harness rewrites every single time. And finally the harness which was code graph codegraph was soda on terminal bench for a while but it's no longer soda and it was essentially just made from that very fact that it was a self- evvolving harness that just got better and better with different models as time went by and it created its own tools. All of that work is also open source. The trajectories are also open source, but I've not released this like at scale yet, but you can for sure check it out as it's still a work in progress.

So yeah, what I ended up building out was just one harness, but the tools that came along with it were also some sort of an evolutionary loop for myself whereby all of these five items essentially made the harness better. And with that, yeah, I guess thanks for coming to AIE this year. And yeah, I just feel like this year will be one of the few years where you keep seeing the bitter lesson. Bitter lessening.

Yeah. Thank you. Jesus, I feel like I should like just bow down right here. Oh my god, what a talk. Thank you. Give one more huge round of applause for a man. My mind is blown. My mind is absolutely blown. Will you come set up while while I like wrap a bit, bro? This We got to collide on stage. You know what I mean?

Oh my gosh, that was insane. Like, what a talk. What a Look, I Yeah, they're going nuts over here. I don't know. All of you are asleep. But like, &gt;&gt; that's it. What's your name? &gt;&gt; Daryl. &gt;&gt; Daryl. Oh, that's right. I see you. Yeah, the lights. Um, listen, I was literally looking for one of those open source projects that he shared.

I was stuck without it. He has saved my my whole idea. That is crazy. And and he's he's so young and he built this thing. I'm I'm genuinely Can we have another meditation session so I can meditate on that? You know what I mean? My gosh, Raj, incredible. Um, we've come to the end of the conference. Oh, w is here.

Yeah. It's very sad. It's very sad. Um but we we must pay our respects respect there. Nobody died. We must we must pay some some attention and some homage to um a grim the final talk. He look he has won the most hackathons in Singapore. I was told uh and he's somebody who came up in this ecos who grew up in this ecosystem and who was doing his part through the conference and the team and the volunteers uh to really bring it home uh and make AI uh continue to grow in traction and and and vision here in Singapore.

And so the brains behind the conference, the heart behind the conference, I've spent the day uh with him walking around and it's very clear to see everyone knows him, everyone loves him. Let's show him how much we know him and love him. A big round of applause for a grim sank. Test.

How to Vibe a Conference in Under Three Months — and Why You Are the Scene

Agrim Singh, 65 Labs / AIE SG (closing remarks — co-organizer alongside Sherry and Rachel; verify exact title from talk)

Summary

  • Origin story: July 16, 2025 lunch between Agrim, Sherry, and Rachel — disdain for the state of SG AI events (too much thought leadership, no builder space) led to a 'we'll yolo our way into running the biggest conference in town' message. 90 days ago they met Swyx, told him they were running AIE Singapore. Swyx was skeptical (no conference experience, no track record); their answer was 'we'll figure it out' — which became the event's operating motto.
  • Pre-event proof points that audience demand existed in SG: a Cursor meetup drew 900 signups (500 admitted), and a 24-hour hackathon drew 1,200 signups (500 admitted) with attendees flying in from as far as the Netherlands. Diagnosis: 'it's not the events themselves — people need a space to be.'
  • Attendance / programming stats: 1,000+ paid attendees across the weekend, no free tickets given out (deliberate filter for high-intent room), 30+ talks per day, full rooms at 8:30am and 6pm, sponsors and speakers all flown in. Sherry iterated 21 versions of the speaker lineup to cluster adjacent domains (e.g., Magic Path / Magic Pattern back-to-back so attendees could form their own product opinions).
  • Singapore-specific programming choices, not copy-paste from AIE World's Fair: every ticket included workshops by default ('this is a builder event, not a thought-leadership fireside chat about the future of AI'), decompression sessions for AI/token anxiety, and a Saturday-night party where Jeff Huntley and Agrim DJ'd before the headliner DJ.
  • Scholarship program — 20 students sponsored after a major sponsor pulled out two days before the announcement. Rachel, Sherry, and Agrim committed to pay out of pocket; community chipped in: Patrick Kelly (Arise — on top of Arise's sponsorship), Neil Chang, Ivan Leo, Kaspar Hidayat, Hsu Ken (Iterative), Zayne Zhang, @ainlovescode, and anonymous builders. Four scholarship students brought on stage; they had been doing the social-media recaps all weekend.
  • Sponsor recognition: Diamond and Platinum tier — OpenAI, ZAI, Google DeepMind, Cursor, Arise. Plus volunteers ('who didn't sleep') and the organizing team brought on stage. Swyx pulled up at the end — 'Swyx is the man behind AI engineer globally, he's also Singaporean, and him letting us do this is why this is happening.'
  • Announcement: AIE Singapore returns in 2027 (QR code on slide). Signup sheet for intent capture; early-ticket access for people who 'took a chance on the first edition.' No 2026-Q4 stop-gap event named — the next anchor is the 2027 edition.
  • Closing thesis ('you are the scene'): rebuts the standard SG complaint that 'nothing is happening, I need to fly to SF.' Mandate to attendees: keep building, keep showing up at hackathons, post about your work, don't ask permission — make Singapore 'a city to be reckoned with, not just in Asia but in the world.' Frames AIE SG as the inflection moment for a multi-year SG builder scene, not an isolated event.

Slides (13)

Slide notes

  • Talk runs 32925-34079s in the Day 2 video (9:08:45 - 9:27:59 venue time, ~19.2 min). 13 slides extracted; many late slides are speaker-cam frames of the team/Swyx photo moment and are noted but not transcribed. [t=9:08:45 / 32925s] Slide 1 — 'but we didn't just yolo'. Two photos: (left) a packed mid-size meetup room with green-shirt attendees; (right) a packed university-style lecture theatre. Visual proof of the pre-AIE meetup and hackathon (900 + 1,200 signups). [t=~9:10 / ~33000s] Slide 2 — '90 days ago, we hit go'. Single line of text; Agrim at podium. Marks the pivot from origin story to execution. [t=~9:12 / ~33100s] Slide 3 — collage: (left) stage at the conference; (top-right) two DJs (Jeff Huntley + Agrim) at decks; (bottom-right) packed lecture theatre. Caption: the Saturday-night party as part of the 'Singapore-flavored' programming layer. [t=~9:14 / ~33200s] Slide 4 — expo-floor photos. Arize booth (top-left and top-right), attendees at a third sponsor booth. Illustrates the 'hallway collisions' / sponsor-team-on-the-floor thesis. [t=~9:16 / ~33300s] Slide 5 — six-photo grid of small-group / side-event scenes (people networking in cafes, group photo at a restaurant, two-person and group selfies). Illustrates the side events that built up to AIE SG. [t=~9:19 / ~33500s] Slide 6 — scholarship co-sponsor list: '@ainlovescode, patrick kelly, neil chang, ivan leo, kaspar hidayat, hsu ken, zayne zhang, agrim, sherry, rachael + all the anon builders!' This is the names slide — capture verbatim, these are warm targets. [t=~9:22 / ~33700s] Slide 7 — full team on stage. Text overlay: 'thank you. / the speakers who showed up. / the sponsors who got it. / the volunteers who didn't sleep. / the team that held it together. / 1,000+ of you for being in the room.' (Confirms the 1,000+ paid-attendee figure.) [t=~9:24 / ~33800s] Slide 8 — 'AI Engineer Singapore will return in 2027' with QR code (in-screen mockup of a Keynote/Figma window with the slide being designed live — meta touch). This is the explicit announcement: AIE SG 2027, intent capture via QR. [t=~9:25 / ~33900s] Slides 9-13 — speaker-cam / on-stage frames of the team, scholarship students, and Swyx coming up for the closing photo. No new text content; these are the 'closing photo' moment that Agrim narrates: students brought on stage, Swyx pulled up ('Swyx is the man behind AI engineer globally, he's also Singaporean, and him letting us do this is why this is happening'), team photo, music.

Full transcript (2670 words)

Test. Hello everyone. Um, there's the last talk of the day, so we're going to keep it nice and fresh. Um, and it's about how to vibe a conference in under three months. This story goes back to July of 16th, uh, 2025. Um, Rachel, Sherry, and I were getting lunch and I think just general disdain about the state of affairs in Singapore around AI events.

A lot of talk, not real builder friendly moments happening. And we hadn't really started doing any events at this point, but we felt like the culmination of whatever we do eventually will lead to us doing a conference. And I sent a message at that point saying, I think we're going to yolo our way into running the biggest conference in town. I didn't think it would happen, but I guess looking at this weekend, it kind of worked out, right?

But obviously you can't yolo this, right? Like it's big to think that okay, we can pack a thousand people into an auditorium and like give them all the AI stuff that we can find. But you got to test the audience out. Is it because the ecosystem doesn't respond or is it, you know, the ecosystem super responsive but the events don't serve them.

So we tried doing a few things. A week after that message was sent, we ran a meetup for cursor. Um, at that point we were like, well, maybe it's one of the first developer meetups of that scale with AI tools in question. Maybe 100 people will show up, maybe 200 people will show up. I think we ended up with 900 signups.

And we eventually let 500 people through the door. And that was pretty crazy to me at that point. Fast forward a few months later, we thought, okay, let's do a hackathon since hackathons used to be pretty big when I was coming up in the scene. And we thought, okay, let's do a 24-hour hackathon. See how many people will sign up.

Maybe people will come, maybe people won't come. 1,200 people signed up. We let about 500 people in. Um, and people flew in from as far as maybe the Netherlands, all around the region. And that kind of gave us a lot of confidence as to maybe it's not the events themselves, but people do need a space to be.

So 90 days ago, we met Swix and we told Swix, "We're going to run AIE Singapore." And I think he wanted to laugh at us at that point because he was like, "Are you guys serious? Like I can't help you as much. I have other AIS to also run do have you guys run a conference before? Will people pay? How are you planning to do any of this?" And I think our response generally was, "Yeah, I think we'll figure it out." And that has kind of been the motto behind the entire event.

So if there are any rough edges around it, I do apologize, but we did try to figure it out. And that's kind of how this went. And all of this sort of centered around high intent. We had intent that we want to make this the best builder friendly event that we could. We wanted to make sure the people in the room had intent that they want to be here.

Tickets are not cheap, I understand, but we wanted to make sure the people who actually want to be here are here. We wanted to make sure that the speakers who want to be here are here. So we flew them out. We wanted to make sure the sponsors who want to be here are here and they gladly sponsored the conference and got involved with this.

So everything was culminating in a way where everyone who actually wants to be in this room today or through the weekend was here. We did not give out free tickets. There were a lot of people who were waiting that things might happen. They might find themselves a free ticket like maybe at other conferences.

That wasn't really the case here. So all of you are in this room because you paid for it and you really wanted to be here. So big shout of like round of applause for you guys and you guys kept showing up like the rooms were full all day and it's like 6 p.m. And you guys are still here. So clearly something was working in the quality of the talks, the things that are happening that you wanted to be here all day every day.

Um talks being full regularly. Every speaker has told me they've had a great time on stage because the crowd has been super receptive to everything they wanted to share without really knowing whether Singapore is the same kind of audience as what they would expect in San Francisco for instance or London for instance. And it's been super heartening to see full rooms every single day. But the thing is that you can't just copy conferences from overseas and bring them into Singapore, right?

It would have been very easy for us to be like let's just take AIE welfare and then copy paste it into Singapore. But Singapore is a different audience. Singapore has different kinds of people. Singapore has different kinds of expectations from conferences. If this was a research heavy conference, maybe we lose half of you.

If this conference is too easy, maybe it doesn't feel like you're getting the rigor that you expect from an AI engineer conference. So finding that balance is a very uniquely Singapore thing. Additionally, you kind of have to make this conference your own because if you're not going to copy something wholesale, what is your contribution to what programming looks like? Sherry had like I think about 21 versions of speaker lineups.

How do you categorize speakers together? How do you make sure that if you're listening to openclaw related talks, you're hearing a few at the same time? Because you then get to sort of see perspectives across few speakers and then come to your own judgments. Maybe earlier today you heard magic path and magic pattern sort of follow each other.

Similar names, similar domains, very different approaches to how they think about product. And this allows you to sort of get your own opinions on how things work. But additionally, we wanted to add our own flavor to AIE events. Everyone here had a ticket to the workshops. This is usually not a default at other AIES, but we think that if you're going to do a builder first event for the first time in Singapore, you need people to build.

Like this is not a thought leadership event. This is not a fireside chat panel about the future of AI event. This is a builder event. And if you're not building at least one of those days, then we've kind of defeated the purpose of all of this. So workshops were part of it. We added some decompression sessions because we feel like AI anxiety, token anxiety is such an uh given these days, given how quickly things are moving that people need a way to understand their relationships with AI and find a way to decompress amidst 30 plus talks every single day.

That part is important. Obviously, in true Singapore style, we want to make sure you guys have a good time. So, we threw a massive party last night where Jeff Huntley and I ended up DJing before we had a headliner DJ come through. But that's again some things that we think if you're going to do an event in Singapore, we have to do it the way we like to do things here.

But obviously, as much as the talks are great, the programming is great. The whole point of running an event like this is the hallway collisions that happen, people you meet in the expo, people you get to talk to, you had the main teams from most of the sponsors here in person. You had the speakers that you could meet at any given point of time. Whether you're getting a coffee, whether you're having lunch, um whether you wanted to just meet them because they're sitting around you attending talks as well.

Giving access to speakers, giving access to teams is something that's very rare in Singapore. If you go to any conferences, whether they be for AI or other things, you you'd mainly see a marketing person sitting there telling you about the brand, exchanging name cards, and that's about it. That's not quite the experience when you're trying to meet companies. Some of them have never been in Singapore.

Some of them have never set foot in any of these conferences. So creating those moments outside the theater was really important to us. And I believe that a lot of you got the opportunity to go around the expo, meet the team. Some of them have flown 17 plus hours to be here. Some of them have never been to Singapore before.

So creating that experience for us was really, really important. And we hope that like AI allowed you to get that over the weekend. But the important thing here is not just about the people in the room already. It's about how do we position the next generation to also benefit from this. As I mentioned, tickets are expensive.

But we shouldn't gatekeep opportunities from kids who are coming up in the scene in university in school through extenduating financial circumstances to access conferences of this ilk because they will be the ones building. So we provided scholarships. There was some information about this outside, but essentially we had one of our sponsors was supposed to be a big organization that we've heard of pull out two days before we were supposed to announce the scholarships. And that was pretty gut-wrenching to us because we wanted the kids to be involved.

So Rachel, Sherry, and I had decided that we'll pay out of our own pockets and do this. But a lot of builders in the scene in their own personal capacities decided to chip in and we could bring 20 students in. 20 students who got to meet the speakers, hang out with them, learn from them, and maybe have that opportunity of a lifetime that they wouldn't have in any situation. We have some students side stage. We'd love to have them on stage.

So, could we have them on, please? So we found these students through all the hackathons we've organized, all the events we've done. These guys show up for every event we do. And obviously all our events are free by design because we want them in the room. But this is the pinnacle of opportunities that we could have provided.

And these are obviously four of like 20 people we sponsored. And you might have seen them around. They've been the ones doing all the recaps on Twitter, posting about it, writing about their experiences, meeting all the people who have flown in, and this was an incredible thing at least that we could have done to make sure that the kids enjoy this. So, thanks again, guys.

I do want to shout out the people who did chip in. I Patrick Kelly from Arise. Arise is actually a sponsor for this conference, but Patrick decided to chip in his own money to support the kids on top of it. Neil Chang, Ivan, Leo, Casper, Suken from Iterative, Zayn, myself, Sher, Rachel, a lot of anonymous builders who chipped in to sponsor 20 students.

So, again, a big round of applause for everyone. So, we've heard this quite a lot of times, especially in Singapore. There's no scene here. Nothing is happening. I think I need to fly to SF to attend a conference. But I think at the end of the weekend, I want everyone to feel that you guys are the scene. You guys showed up regularly.

Every talk, every workshop, around the expo, at like 8:30 yesterday, 9:00 a.m. Today, through the rain, through whatever conditions could have stopped you. You guys showed up for all the side events we organized in the leadup to this. Every event was oversubscribed. Every event had hundreds of people showing up.

Even if you didn't know the companies, even if you didn't know who was going, just because you knew that there was something bigger going on that you could be a part of. And I want that some that to be something that you guys remember that because this is something that goes beyond just AI Singapore. This is what's going to build the AI builder scene in the country for years to come. And that's why this isn't an isolated moment.

I hope you guys keep showing up. I hope you guys keep building. I hope you've made friends over the weekend that you'll stay in touch with. I hope you go and build at hackathons, maybe start some stuff together. I hope you post about it. I hope you don't look for permission to share the work that you're doing because this how people get to know that Singapore is a city where action's happening that it's not only SF where things are happening.

It's not only London where things are happening, but Singapore, not just in Asia, but in the world, is a city to be reckoned with. And on that note, I really want to thank all the speakers who came, the sponsors, our main sponsors, Diamond sponsors and platinum sponsors, OpenAI, ZAI, Google Deep Mind, Cursor, Arise, the volunteers who didn't sleep, the team that held it together, the thousand of you who came. I want to call the team up, both the organizing team and the volunteers up on stage because these guys have been the backbone for the entire weekend running without a hitch. These guys made sure you guys got fed.

These guys made sure you guys got your badges and access sorted out. These guys made sure that you didn't see the stuff that was slipping through the cracks just so that you guys could have the best conference experience possible. We're not done yet. Hold on. &gt;&gt; So obviously like in true Marvel movie fashion, you know, AI engineer will return.

Uh we have a signup sheet for people who are interested. Uh we'll send out some early tickets and like information as we figure things out, I guess. But we do want to make sure that we have your intent recorded so that if and when we announce in the near future, you guys are the first ones to know because you guys took a chance on us. For guys who you've never heard from for a conference that's never been in this part of the world to take a chance and show up for the first edition in numbers and regularly is something that we cannot take for granted.

And we are really, really, really grateful that you took a chance on us. So, thank you so much again. &gt;&gt; Can we get some music in here? &gt;&gt; Yeah, we do another photo. We do another photo. &gt;&gt; No music. &gt;&gt; Where's Swig? Swix, come up. &gt;&gt; Swix, come on up. &gt;&gt; Swix is the man behind AI engineer globally.

He's also Singaporean if you heard yesterday and him letting us do this is why this is happening. So thank you Swix. Swix please night. Going to be just you and I. &gt;&gt; Just you and I. &gt;&gt; All right. 3 2 1 Can we dance? How do we photo? Hey, hey, hey. Hey, feel me. Hey, hey, hey, hey. Hey, hey, hey, Hey, hey, hey.

Hey, hey, hey. Hey, hey, hey.

↑ Back to Day 1

Need the sovereign-policy day? Vivian, Sottiaux, swyx, GovTech.

→ Go to Day 1