Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

This subtitle was translated by AI. We cannot guarantee its accuracy and it is provided for entertainment purposes only. Hello, everyone I’m Xiaojun In this episode, we have come to New York, USA It is the Chinese New Year right now New York just had a heavy snowfall This is the coldest winter New York has had in years The streets are still covered with unmelted ice and snow

But today’s conversation gave me a feeling of the warmth of everyday life after the thaw Sitting across from me today is young scientist Xie Saining He has just embarked on an entrepreneurial journey together with Turing Award winner Yann LeCun setting out on the entrepreneurial journey Their neo lab, AMI Labs has just completed its first mega-scale funding round

The team currently has 25 members Xie Saining has always told me he is not the “chosen one” he is the ordinary one And now, here is my interview with Xie Saining Ilya called me and I didn’t say anything I just turned down OpenAI They sent me an offer and I said I’m not going, sorry But wherever there is love, there must also be hate

They are two sides of the same coin [laughter] This morning we are in New York shooting B-roll in Brooklyn I really like it here Because I live near Times Square I think that area is still a very stereotypical New York But coming here feels like a New York full of artistic vibe and lively neighborhood energy

Yeah I think this area of Dumbo is of course very artistic Right, in many films There was a Korean film called Past Lives In that film, you may have seen the carousel And the Dumbo bridge over there, right Only tourists go to Times Square I am a tourist Real New Yorkers would never go But actually the area near NYU is also really good

That area is called Greenwich Village And that area is also a “village” And that area also has a great neighborhood vibe Why did you come to New York to do academia? That doesn’t seem like a choice many people make Well, not really But there is quite a long history That is true Various reasons I think

Of course Also because I genuinely yearned for this city Right I longed for many elements of this city The people here And including NYU That was also part of it And of course the main reason was still Yann (Yann LeCun, Turing Award winner and Executive Chairman of AMI Labs) And the AI efforts here Right NYU actually does quite well

But on the other hand NYU also has a very strong film school And many directors I admire Like Martin Scorsese Including more recently Chloé Zhao are all NYU graduates So that’s also partly the reason Right, also Also part of the reasons Right, I I I told you yesterday

I think — how many years has it been since I came to America I came in 2013 So it’s been about 13 years My ‘post-training’ is a bit broken now So the issue of mixing Chinese and English Sorry about that, viewers I’ll try my best to explain Please bear with me Please bear with me Please bear with me

Mm, it seems I haven’t found anywhere a podcast of yours or an interview So Is this your first time doing a podcast or interview? First time doing a podcast First time doing a podcast First time doing an interview Right, you can probably find many Me going out to various conferences, right talks at conferences

giving talks and such many of those Why Why haven’t you been on a podcast all these years or done an interview I think Mm I don’t know I think I’m more suited to being a listener I really enjoy podcasts Right I often listen to a lot of podcasts My Spotify

YouTube, commuting every day, and before bed I often listen to podcasts in my spare time Mm, right And I think I have quite a desire to express myself Or rather I also talk about a lot of things with friends privately With students I think, mm Getting everyone together to chat, I think that’s very enjoyable Mm, but this podcast thing

I don’t know either Maybe it’s because nobody invited me That shouldn’t be the case Um, well, a little I guess But I still think Maybe it’s also because I’m more introverted I think a lot of times feel, mm I don’t know which things should be said which things are worth saying which things people would want to hear

But now I think, gradually as I get older it’s fine, it’s okay I have gained the courage to be disliked I actually looked up a lot about you online a lot of information But I found everyone’s description of you all starts from SJTU’s ACM Class And I’m also very curious What was Xie Saining like before that?

Could you start from your earliest memories of the world as the starting point and tell us about your childhood and growing up I Ah, OK See, this is exactly why I didn’t want to do a podcast [laughter] Because, honestly I’ve never prepared for this Or rather, you have to let me think back

from the earliest memories Well, it’s I think starting from when I was little Maybe When I was four or five years old Mm, my mom would take me traveling everywhere That might be my earliest memory Oh, where did you travel? All kinds of places Right, because she also did some business and traveled around everywhere

Traveling all around the country, right I remember very clearly, right This first impression of Shanghai And going to Sichuan, and then all kinds of tourist spots you can imagine Um But for me If I really have to dig into the family background which was My dad is a complete homebody Mm

never goes out But his favorite thing to do is read books So at home, there is a study room with several walls full of books So When I was young, I was basically in this state either running around outside being taken traveling by my mom or at home browsing through all kinds of books books I should read, books I shouldn’t — I’d look at them all

Right And I think that was my early childhood And then later on And indeed later I think our generation’s growing-up experience was quite different Because I think — well, I don’t know I think kids today might, in this AI era have the same feelings But back then for me When I was about 9 years old

I got my first computer And from that time on not for anything productive, right buying games box by box and playing them Then the internet came along and for the first time I felt this information explosion So That was the first time I understood what “content” meant And at that time I felt I suddenly had more desire to express myself

Because reading books is still one-directional this learning process though also very broadening But online, there were BBS forums back then And you could go online to share your opinions I still remember, right There was Sina Blog It probably doesn’t even exist anymore But I wrote a lot of blog posts Oh, really?

Ah, um about all kinds of random topics Now Looking back now, it’s definitely very funny But What was the most popular article? Quite a few, I think I remember It felt like forced melancholy — writing sad words without real cause Oh Maybe including QQ Space back then, right Everyone always wanted

a platform to express themselves And then later there were actually even more new media emerging including blogs then Weibo, right But back then it wasn’t Weibo actually It was Fanfou — I don’t know if you’ve heard of it Of course Wang Xing, right And at that time I was also a heavy Fanfou user On it

Fanfou can still be logged into now But it’s really hard to look at Sometimes I look at it I think, oh gosh Should I just delete it all But then I think Let it stay there Let it become part of the internet memory Mm But I think at that time I think I think this explosive growth of the internet

made me become someone interested in many things Mm I think that’s how it was So, your parents Your mom was in business Were you from a business family? Not really, not really Um Well, my dad basically He studied psychology in college He also did some education work before

And later also in some media work at TV stations Oh Maybe the same profession as you Oh Right So my memory of him when I was little is of him carrying a camera everywhere Oh, that’s interesting Right, right, right But in my family there really wasn’t anyone who studied pure science and engineering

This also gave your personality I think quite an artistic side Maybe, but But I think I I think the one thing I want to say is Growing up in such a relaxed family environment has really shaped my model of the world I think, about my own I’m still quite proud of it Mm quite proud Because I think I would

Or rather, you just asked why I came to New York I think that’s part of it too Mm I think I would hope for myself or hope for the people around me to look at the world with a more open mind Were your grades always very good? Because you were admitted to SJTU’s ACM Class through recommendation Um, not at all It was from high school

Right, I think it was like this So, you can see Now I have many, many friends around me who are actually all those who’ve come up through the top track Right the best high school, right then the best undergraduate competing in competitions the best undergraduate then the best PhD then after finishing, going to teach at, say, the top four universities

There’s a very clear main path, right And I have great respect for them I’m completely not like that I’m a, um At most, I have a B-class kind of trajectory Oh Like you And many My decisions are actually quite mystical Because I think I haven’t deliberately, in some kind of meritocratic

this kind of setting framework to strive for things Many times it was actually quite random And maybe that’s just the way it is The intelligence just isn’t enough But indeed For example, when being admitted via recommendation, right That was also very accidental Anyway, there were two awards in informatics and math competitions

And at that time SJTU happened to have this program where you could enter early basically trying to recruit some students and have them skip the college entrance exam Right Actually, I was originally following the gaokao path being prepared for it, actually I um, was supposed to be taking the gaokao So I struggled with this for a long time

The teachers at school all said, no, that won’t do How can you back out at the last minute Your grades are already very good, right You should of course aim for Tsinghua or Peking University But my inner thought was Well, SJTU seems great, I think I’ve been to Shanghai I feel like me and this city and this school share a compatible spirit

And I just wanted to study computer science And I think SJTU’s computer science was also very good at that time I had also heard of this ACM program Although the selection process back then actually required you to enter early and after entering there was a summer camp a program like summer camp Right, and you would undergo some tests

before you could enter this class Right But many interesting things happened in that process Of course, first let me say I think I was quite How should I put it If I could choose again I wouldn’t regret it at all Right, I think that summer before entering early was a highlight of my life Why

Because during those two months, I did nothing just played games in the dorm Why is that a highlight? Because never again in my life did such a moment come again What games were you playing back then? Um, many games Playing Dota and such Just in the dorm It was that kind of the kind I saw online during high school

college life You know? Ah, it was There was the studying part But also some finding yourself and in this kind of aimless wasting of time kind of experience Right So Xie Saining’s life highlight was wasting time Really? In the dorm? [laughter] You could say that

Haha, that’s very interesting You keep saying you weren’t among those with the best grades But you’ve also had a pretty smooth path You seem to be among the highest achievers too Why is your self-perception My grades are actually average It depends on who I’m comparing to Compared to the top competition winners like what I just described

those who had a very smooth path the top students from Yao Class and then comparing with the top four PhD programs, top four professors Then I really am far behind But on the other hand I think I’m still quite grateful for all of these experiences Because I feel continuing the story from here I think it’s actually quite interesting

For example, when I went to SJTU SJTU wasn’t necessarily in terms of computer science and artificial intelligence a particularly leading school And now for example, the ACM Class has become Of course, this has nothing to do with me But my juniors including my seniors, right whether doing entrepreneurship or academia

shining and contributing everywhere And also We have a very strong alumni network everyone connected, working on things together I think I still think it’s an upward trajectory An upward trajectory And then later still There is another very interesting thing in here I want to mention

which is my ACM Class interview And in the interview process there would be senior professors Back then it was Professor Shen Enshao who interviewed us This interview didn’t actually ask you technical questions He would ask you, what books do you like to read Mm And I feel this was somehow destined there was some fate involved

Because I was very anxious back then and almost couldn’t answer Then I told him A book I actually really like and one I just finished recently, is this This book is called What Is Mathematics? Which is “What is Mathematics?” Then Professor Shen Enshao followed up and asked Who is the author of this book to test me

And I was a bit stunned And you know, right A high school student I can’t remember foreign names either I thought about it and ultimately managed to answer It was Richard Courant Richard Courant And then Professor Shen said Ah, right You must remember this name Because this is equivalent to

one of the greatest mathematicians of the 20th century Why does this make me feel there’s a certain destiny at play or some coincidence in this is because now at NYU the department I’m in this institute is the Courant Institute of Mathematical Sciences which is Richard Courant’s institute the first shovelful of earth he dug

the department he built Mm So, I think it’s quite interesting Right And the application process later was actually similar I think Or to put this from another angle I think It seems like the world always doesn’t want me to do what I want to do Why But But I insist on doing exactly what I want to do

Oh For example, during my undergraduate years I was initially interested in computer vision, right Or rather I developed some interest in artificial intelligence At that time also Starting out in the ACM Class Everyone would start doing this kind of research internship and would go to various labs within the school

to different laboratories And the lab I went to was one doing neuroscience + AI work called BCMI And the bookshelves had so many books about consciousness about the brain about images And then about how we perceive the real world books like these And after looking at them I thought, wow

That’s so interesting And um Later, in this process I also got to know a senior classmate of mine This senior was Hou Xiaodi Oh And he is also very well known He had previously also started a company and now is also doing entrepreneurship And every time I talk with him he always says The world has changed

But we haven’t changed By “we” I specifically mean him and me Because every time we chat it’s exactly the same as what we talked about over ten years ago Right, at that time he was a legend at the school Right, and he did two legendary things The first legendary thing was that as an undergraduate he published a paper at CVPR (one of the world’s top computer vision conferences)

Right, and in this paper was a very elegant algorithm with only 7 lines of code in total that solved a very important problem and published a paper Mm CVPR now accepts maybe several thousand papers each year thousands of papers Right, tens of thousands of submissions So now, when we’re looking to recruit undergrads

everyone has three, four, five papers each CVPR is already nothing special But at that time at schools in mainland China being able to publish work at such a top conference was actually extremely, extremely difficult very rare very rare And then For an undergraduate to publish such work was unheard of

So Everyone truly admired him very, very much Mm But then he did a second very impressive thing which was, um he led a team and wrote something called the “SJTU Survival Guide” “SJTU Student Survival Guide” Oh, this was written by a team? Um, he should be the main author

I don’t know A team followed him in it This thing still has an archive online now I welcome everyone to check it out offline So what does this guide talk about And some of the things some words I went back and revisited it just a couple of days ago I found it very, very interesting Right, um

What does it talk about It talks about why people should learn China’s education system the university model what exactly is wrong with it where you should spend your time to achieve the life you want Mm And it also guides everyone on how to do research what the purpose of research is the purpose of research is not to churn out papers

but is truly about exploring the infinite unknown things like this Of course It also teaches everyone how to skip class how to complete assignments in a quicker way Right, it’s this kind of pamphlet I also went and read it It says if a person treats grade scores as their highest pursuit

then they are a sacrifice to that system Mm, I completely agree Right, I think looking back on these things now probably had a subtle influence really influenced my understanding of many things When he published this what year were you in? Um First or second year First or second year You already knew him in your first or second year?

By that time he had already been admitted and gone to Caltech for his PhD So he and I were Because he also graduated from this same lab So he and I essentially communicated online Hou Xiaodi was at Caltech at the time and was already doing his PhD He had also been admitted to a great school And we were all very, very envious

At that time And he and I would still on Google Chat back then chat with him about many, many things And he really was also gave me a lot of advice I still remember What advice? Um, nothing specific More often when chatting with him online it was more about research Right, what exactly should be done

sharing my own confusion with him And then and how to how to get a paper published roughly seeking his advice Right, and at that time But at that time I think through Xiaodi through the books I read I had basically decided I felt this is what I want to do with my life I think this thing is just so fascinating

computer vision Um At that time there wasn’t actually a name for it or rather, computer vision was slowly starting as a term But actually before Right and people had been processing image or visual information for a long time already For example, people would do so-called image processing which is image processing

Um more often starting from an EE major Right, and computer vision might be, um gradually becoming more and more popular Mm And then which was around when I started learning these things this knowledge it was starting to become more and more popular Right, and then Um, as I just said

The world always doesn’t want me to do this is because when I was in SJTU’s ACM Class there was actually another feature which is that every student in this class had to do an internship in their third year Mm That’s actually quite common now But at that time it was still mainly this class’s founder’s, Professor Yu Yong’s

innovation So at that time, most people in the ACM Class would work with Microsoft Research Asia which is MSRA through a cooperative program so many of our students were sent there to do approximately a 6-month internship Right, so Um, originally for me If I did nothing I would go to MSRA for internship

Right, although that was also good But at that time there actually wasn’t a vision group willing to accept undergrads from the ACM Class for internships Why is that? Um, I don’t know Maybe because back then, professors like Ma Yi and Sun Jian were all there Kaiming should have been there too by then And I think

they probably didn’t like having too many undergrads who don’t know anything coming to participate in things, right At that time, they were extremely talented Yes, yes, yes, exactly But we really didn’t know anything Right I think I can gradually understand this now Um, but at that time, um, there was a choice which was still to go to MSRA

but not doing anything vision-related research And Professor Yu also told me, well actually you undergrads the most important thing now is still to have research experience and learn how to do research what specific direction isn’t very important Mm, right, um But I didn’t think that was okay

I felt I couldn’t accept that doing a completely different direction I wanted to understand this field more I hoped to work diligently on some things And then and hopefully one day be like senior Xiaodi being able to publish a CVPR paper Xiaodi was already your idol at that time, wasn’t he A bit

He was many people’s idol Right, during SJTU days Oh um, and then So I started thinking about how to handle this And started sending emails So I contacted NUS in Singapore, right National University of Singapore’s Professor Yan Shuicheng’s lab Mm, right This was entirely my own doing

I didn’t even tell Professor Yu And after it was confirmed, hey I can have this internship opportunity And on his side there were already some subsidies and talking about timing and arrangements the structure was already fairly well set up Then I went to find Professor Yu I said, Professor Yu I really don’t want to go to MSRA

I want to go to Singapore this school’s lab to do the research I want to do Mm Professor Yu was silent for a few seconds Right, um, maybe I guess I don’t know I haven’t asked him this question But I guess his inner thought was this student is so headstrong Right Because in the professor’s mind

MSRA was a better choice Yes, yes One, a better choice Two, I think it also allows everyone to go through Right keeping everyone together I think one reason is of course easier to manage Second, there would be more synergy Right, everyone could still exchange ideas Then you going to a new place

what does that even mean is this place even reliable is what you want to do reliable this thing might be uncontrollable Were you conflicted about it? I wasn’t conflicted But I really appreciate Professor Yu in that he Anyway, he was silent for a few seconds and finally said okay You go ahead. Right, um, and so I went

But this thing after it happened Professor Yan’s group NUS’s lab became an option for my juniors an available position Mm So I think So I think I still want to take some initiative I think taking some initiative and doing what I want to do Right

was still very early at that time image-related artificial intelligence what exactly attracted you why did it attract you that led you to make many different choices Because I think the way I experience the world is through vision Mm, I would think I was probably a bit bored when I was little and I would think, hey

humans have so many right, senses If I had to remove one which would I remove I think maybe I could be deaf maybe I can’t speak maybe I have no touch, no smell I would live very miserably but maybe that could still be accepted But if I had no vision then I can’t watch cartoons anymore

I also can’t watch movies I also can’t play games I would seem to have lost a person’s independence And I think Of course this these initial thoughts and later in some books I read what was said resonated quite well Um, because visual signals actually occupy a large part of the brain’s cortex

um, depending on how you say it, right the main visual areas might be about um, 30% of the entire brain But, um when the entire brain sees an image the activated parts might make up 70% Mm Right So Actually, all of us humans are visual creatures And this Right, that’s what I think

I’m also a visual creature I also very much like looking at things Animals too Not just humans Not just humans, right What you said is very, very correct Mm, actually it’s not entirely like that Because actually 530 million years ago, 530 million years ago on Earth these creatures actually had no eyes

everyone lived in the deep sea without light Right, everyone was in the deep sea and light couldn’t get in And then suddenly one day some creatures were able to develop their vision Although still very weak only able to see a faint signal Right But at this point they were amazing

They could see the prey they wanted to hunt where it is, and swim over quickly and eat it They could also avoid predators someone’s coming to catch me I immediately run away Once vision was born Um other creatures in the evolutionary process had to evolve stronger vision Right, because

if you don’t have stronger vision you’ll be eaten Right So an arms race began So this is the so-called Cambrian Explosion what is called the Cambrian Era That is to say, on Earth before the Cambrian period there may have been only a handful of species But after the Cambrian suddenly like a big bang hundreds of thousands of species emerged

One leading theory is a theory that this explosion’s origin was actually because creatures had an arms race at the visual level Yes, yes So what you said is completely right I think This is actually not something unique to humans I think all animals are actually the same Mm

And so I’m still quite interested in this And you know this thing called vision isn’t just a sense There is a saying that the eye is actually the only one it is part of the brain but it’s the only one part of the brain exposed to the real world because other parts of the brain are all hidden behind our skull

Mm, right So thinking about it this way solving vision isn’t about solving vision itself but about solving intelligence itself Right, so I think everything can be connected From before you even officially started your first year hiding in the dorm playing games wasting time to you finding computer vision as the main thread of your life

what happened in between? Mm, actually nothing much happened Actually many times I think it all comes from chance Mm Just like if I hadn’t read that book back then I probably wouldn’t have taken this path But sometimes I feel this is also inevitable I still quite believe everyone actually has their own destiny

Or rather Sometimes I tell students Don’t think that if you don’t do this someone else will do it Instead think: if you don’t do this this thing will never happen in this world What does that mean? meaning you are now working on a research topic Right and the thing you’re doing

how you got here step by step to this endpoint this thing completely depends on yourself your personal life experiences your background growing up maybe a book you read maybe a conversation you had with someone maybe it’s genetic your genes wise simply being different from others

Right, I think every individual in this world is very unique everyone is a variable in this world everyone is a variable in this world and who can say for certain It’s possible you are the most important variable in this world This is your worldview I think it’s my optimistic side [laughter]

Right Mm During your time at NUS Did you get what you wanted to get? Um, I think I think yes First of all, I made a lot of very good friends I can gradually elaborate on that later But I got to know For example Actually the main person who mentored me then my mentor was Feng Jiashi

He was a PhD student at the time Right, and he mentored me And then did some work We published a paper Not a top conference either Unfortunately, I still couldn’t publish at CVPR during undergrad Mm But we published a decent one this BMVC paper Right, it was a not-so-top-tier computer vision

paper So, um I think I still think there was a lot to gain For the first time I learned um, research what it’s about Right Having actually written a paper versus not having written one I think there’s still a big difference Was that your first paper on CV? Yes, yes

But you could say this was a CV paper but actually it wasn’t really about CV Its only application was face recognition it was more like a machine learning paper But that was normal at the time everyone studying CV or researching CV was doing similar things the so-called

manifold clustering related things Right, but it was at that time point That was 2012, 2013 2012, right So it was right at the AlexNet moment Mm So I was also at that time point learning about this Right, and then right and learning about ImageNet learning about deep learning So I think that was actually a starting point

That was when I just started doing research and learning how to do research and also a starting point for all of deep learning This was your third year Third year, right University was almost over at that point So you actually during your undergraduate years had already found your main thread I think so Mm What was your intrinsic reward mechanism at that time?

I think it’s still curiosity Right, it’s that I I think I want to know why Right Or rather This might also be my own explanation I also don’t know what exactly my intrinsic motivation is But Mm I want to understand more I want to understand

more about this field I want to engage with the top students in this field researchers professors and have deeper exchanges Mm-hmm So this is also why later I decided I still wanted to go abroad wanted to apply I think also Probably this reason too Here I want to ask a small extra question

You must also have many friends from Tsinghua’s Yao Class Right, I also have many friends from Tsinghua’s Yao Class who have come on my show Yes, I want to know Tsinghua’s Yao Class do you think compared to SJTU’s ACM Class what is the biggest difference in terms of training I think the ACM Class is probably less competitive

One difference is, um, again this thing is actually still Professor Yu’s design He, I think, is, um quite a great educator I can say that Mm, right Like back in our days actually in our curriculum design um, there would be many seemingly quite strange settings For example, we had a course

that Professor Yu was actually very proud of called the ‘Student Forum’ What is this Student Forum? It means everyone comes to this class and spends maybe 45 minutes to 1 hour to do a presentation give a talk And this talk cannot be related to studying It can be about anything in the world but cannot be related to studying

Right, so, um some people would talk about philosophy some about history some about society some about many very interesting things Of course science was also allowed Mm, right And I think I think this might be a difference in cultivation approach Of course I’ve never been to Yao Class so I’m not sure

But I think everyone was still in a relatively relaxed and more liberal arts-focused kind of setting moving forward Mm, the impression you give me is you don’t seem like someone who likes excessive competition Um, I think I’m not afraid of competition but I genuinely don’t like excessive competition And I think excessive competition definitely doesn’t help innovation

Right, I think I think this Of course that’s not saying ACM Class has no competition there is actually very strong competition Were you a winner in this competition? I wasn’t eliminated OK Right But actually it can’t really be called elimination which was everyone felt whether they were suited or not

and would choose to stay or leave What was your approximate ranking in undergrad? There were maybe 30-40 people total Maybe ranked around the teens Just not pushing myself too hard Not pushing myself too hard Mm Did you ever think about becoming for example, first or second in the ACM Class? Was that your goal?

I couldn’t have Right [laughter] Really, really couldn’t Because we had very strong Right, um students with competition backgrounds And the evaluation criteria I think were actually quite multidimensional it’s hard to say who was first or second Or if you only look at GPA then I really couldn’t

Mm, right And I think And for this maybe also inspired by the Survival Guide I also didn’t care that much So from that time you started following your interests very closely Yes, right I think pursuing my interests and I would do everything possible to make it happen Right, especially in the application process it was the same

Mm A previous example was you going to NUS instead of going to Microsoft Research Asia Right, when applying Actually there’s another story here which is that I almost didn’t get into any school but ultimately didn’t I did have some offers but none from a professor I wanted to work with doing computer vision

Oh This made me very, very depressed And at one point I would think Okay, I could go do some recommendation system research some more um, you know machine learning research Oh Um, until finally And then I I started frantically writing emails to everyone those cold-contact emails

Mm, right And then Professor Tu Zhuowen Right, Professor Tu replied to me But by then it was already very, very late Because you know For PhD applications the deadline is generally April 15th Right, I actually received this reply in April Oh Right Who was the professor you most wanted to work with?

At that time Um At that time there weren’t many professors doing computer vision Right, and then I think Professor Tu was certainly a professor I admired very, very much So I think he was also my top choice Right, mm And of course there would be many You would of course say

Like at Stanford Berkeley, right MIT would have many pioneers of computer vision But at that time beyond my ability Mm, right So I sent this email to Professor Tu And he replied to me And I remember very clearly Because of the time difference So Professor Tu asked if we should have a call

When are you free I said I’m free at any time And so at 3 AM downstairs in the dormitory I had this phone call with Professor Tu Telling him why I thought I wanted to do this Mm, what things I had done before And why I thought I very much admire your research I think we can work together

Right, so Later, Professor Tu rescued me Very, very, very lucky In the last few days In the last few days he rescued me But there was another twist later Because at first Professor Tu Zhuowen was actually at UCLA Right So the offer I received was UCLA’s offer And I got my visa sorted and was ready to enroll

And then about a week before Professor Tu said I’m sorry I’m going to change jobs I’m at UCLA for various reasons I don’t want to stay anymore I don’t want to continue here I’m going somewhere else Where am I going? Right now I can’t tell you either I don’t know either

Because he was also in interviews at that time Oh, really? And he told me You have a few options One is you can stay at UCLA and I’ll hand you over to other professors Or you can wait and see how my situation works out And possibly if I go to a school you’re willing to come to you can come with me

So did you wait? Or did you immediately say, I choose you? I basically said I immediately said, I choose you You didn’t care about the school? Um I think I don’t care about the school And I still think I think all these things are very interesting Because back then if you looked at UCSD in terms of overall rankings

nothing compared to UCLA Mm Now it’s completely different If you look at CS rankings or from AI hiring and students including faculty resources in terms of AI strength I think UCSD is already among the top few Back then, it was completely different Back then And I actually always wanted to collaborate with a professor

named Serge Belongie who had just decided to leave UCSD too Well, so I felt everything was hopeless which was the place I was going didn’t seem highly ranked um, and then faculty were also leaving faculty were also leaving But I thought about it and said none of this matters none of it is important

what matters is who I’m working with and on what and whether this is something I want to do I think putting aside all this noise this is the only thing I want to care about Mm, that’s very interesting Mm So this kind of thing happened several times I just said At SJTU it was also an upward trajectory And then going to

UCSD That was also part of it which was Of course I’m not saying this has anything to do with me I don’t think it has anything to do with me But somehow I feel I can see a place or even a person their upside potential that is, their potential Mm And I’m willing to grow together with those places

I think This is something I feel quite deeply How long did it take you to find out Professor Tu was going to UCSD? Um, maybe a few months later Right, maybe one or two months later Were you worried at the time? Of course I was worried Right Because Professor Tu is actually very humble extremely capable but very humble

So he would always give me a heads-up saying the school I’m going to might be ranked lower you should think about it Right, what did you say? I don’t remember very well what I said But again, for me this might not be that important And and at that time it wasn’t yet time to make a decision Right, why should I

worry in advance about things that haven’t happened So I didn’t think too much about it Did anyone else make this choice? Among the students Professor Tu communicated with Um, basically none I was the first student he recruited at UCSD I think just based on that Professor Tu must like you very much Um, I think all of this is

I think it was also him saving me Um, indeed But this was not only rescuing me at the beginning and then later doing research during the PhD process I think he truly helped me Right, like my internship in Singapore and such you could say we were doing some research but in reality it was still small-scale stuff

having someone next to you teaching you the feeling is still different Professor Tu is the type who sits beside your monitor and goes through the code with you line by line that kind of teacher Mm, and he often I think proudly would tell us these things And I think he is very deserving of this pride, meaning he published several papers

that actually had an important influence on later computer vision all completed as sole author works And these works didn’t have, like now everyone using PyTorch with so many open-source communities so many libraries you can use right, having GPUs in his time there was nothing he had to write from the ground up

For example, for a task like image segmentation he had to write from scratch about 50,000 lines of code He even sent me this code to look at That included the lowest level including some distributed training a whole series of things all written in C++ Right, 50,000 lines of code I think On one hand I feel I’m very lucky

not needing to go through all that But on the other hand I think actually their generation in America these scientists these professors are truly admirable Right, if not for them there would be no us today They actually, um blazed a trail Right, this path didn’t originally exist As I said, right

publishing a CVPR paper was actually a very, very difficult thing And there was a certain circle a certain fixed circle Right, and I think it required Professor Tu and actually his boss Professor Zhu Songchun and including later people like Fei-Fei (Li Fei-Fei, Stanford professor, co-founder and CEO of World Labs) and so on Professor Fei-Fei

everyone blazing this trail so that we have a path to walk Mm, I saw a Xiaohongshu comment saying Xie Saining was unremarkable in China nothing special made a big splash when he got to America So what exactly is the variable? First, I don’t think I was unremarkable in China Mm, I don’t accept that And I didn’t make a big splash in America either

I don’t accept that either I feel like the things I’ve done have been a fairly smooth a very gradual process Right, and or rather I think this is also what I hope um, as a researcher, right this kind of science practitioner I hope to be in meaning this is not a momentary burst of hormones or adrenaline

this thing might be a lifetime of building a very quiet process I hope to be in such a state When I say such a state it’s because I know many people are in this state the researchers I most admire they are in this state they didn’t say there was this sudden rise to fame

or at least their way of doing things is not or their purpose is not to become suddenly famous Right, I think so Then what is it for? It’s for thinking problems through Mm How did your PhD work unfold? The PhD work was also very interesting PhD work Um, I think it was also through Professor Tu’s hands-on mentoring

Right, but um We had our first paper By the way, I During my PhD I wasn’t a successful PhD student either by today’s standards I published maybe five or six top conference papers What level is that? I don’t know That should have been fine for that era the level to get a job at a top lab

Now it might already be Right, now now many of my students publish many more papers than I did and the quality of work is also much better But anyway At the beginning I think we did a work called Deeply Supervised Nets Mm This work was actually Me and another more senior PhD student

completed it together in collaboration And at this time This was around 2013, 2014 And at this time, deep learning finally began to explode But I think this was also a very interesting moment Because actually many people didn’t accept this Especially many professors working in computer vision didn’t even accept this Everyone thought

deep learning was still a kind of alchemy still a black box people trusted traditional machine learning theory more trusting SVMs, or trusting some Bayesian theories Right being able to pivot in time to do deep learning research This now, looking back with the benefit of hindsight is a no-brainer you didn’t need to make that choice

right, you should just do it But at the time, making such a choice I think required some courage So Professor Tu actually is another reason I admire him very, very much and I deeply affected by this this one thing That is to say he actually pivoted very promptly So this Deeply Supervised Nets

was in this era our first deep learning work Right, so this thing was actually simple it was about how all of these neural networks Um previously were just a single stream a long chain with your input and getting your output And now Deeply Supervised Nets but this robotics isn’t simple robotics meaning

you can now actually have multiple branches that is, your neural network can actually have multiple exits and at different exits you can apply a supervision signal In this way the most direct benefit is you can not only from the signal at the far end do back propagation back to the early layers

back propagation you don’t need to do back propagation from the far end all the way to the beginning you can actually from an intermediate node do back propagation This way can partially solve the vanishing gradient problem Mm And this actually relates to what came later for example, ResNet actually has some resemblance

it’s actually or in that era everyone actually wanted to solve this problem So Deeply Supervised Nets was a way to solve this problem Actually this thing though it was long ago right, this was again 12 years ago but I think research is like this 12 years later actually some of our current papers

are again using the same kind of design sometimes we don’t even realize it I think this is very interesting But let’s not talk about 12 years later Right, so my second paper was called Holistically-Nested Edge Detection (HED) a work on edge detection HED Right, I think about this paper I’m actually quite proud of it

Because frankly it solved a research problem um, it was both lucky and unlucky The lucky part is this paper was a good paper The unlucky part is once the problem was solved nobody worked on it afterward so nobody cited your paper [chuckles] so it lost many citations [chuckles]

Um, but um But this work is essentially a Deeply Supervised Nets DSN applied to image or edge detection but it’s actually a global what we call pixel labeling pixel-level annotation task implementation Mm And this also opened up many new ways of thinking for me

because I would discover that a neural network each of its layers actually has implicit structure and information in it your neural network, again has not only input and output in between there is a lot of information it represents a so-called hierarchical hierarchical structure of the world

For edge detection it represents that your early layers output edges that are more so-called coarse more coarse edges Right, and the further up the more refined your edges become So Finally you can take all of these edges and fuse them together to get one that best approximates human perception

such an edge output result I think this was actually also giving me a new understanding of deep learning It’s a very interesting, very interesting thing You can think of it as a black box but each part of this black box you can open up connect some new inspiration and reach some new goals

I think this was very enlightening for me And this paper at the time also had a big impact on my life because it was published at ICCV and also received an award This award was the Marr Prize the Best Paper Award nomination not the Best Paper Award itself just a nomination But actually for the Marr Prize it selects two papers

which is equivalent to the Marr Prize and Honorable Mention are two awards So this made me feel if you want to say sudden fame I really did feel at the time look, I also became famous at a young age Now, of course we have many Chinese students also on the world stage winning so many Best Papers Right, but back then for me

walking onto that stage or that podium and giving the award presentation giving this talk I think it moved me greatly I felt, wow my life has begun Right, and I will keep working hard I will have more and more best papers Ah, unfortunately this was my last time receiving Best Paper [laughter]

What year of your PhD was this? Second year of PhD [laughter] And up until now Just a few days ago during Spring Festival people were still texting saying Happy New Year May you have many Best Papers I said it’s been 10 years everyone has been wishing this for me and I still haven’t received another one

Do you still want one? Um Good question Well I think this thing isn’t that important to me anymore On one hand I know the process I know actually um, whether I get a Best Paper or not might not represent the quality of the work And I also know the Best Paper I got Honorable Mention

was mostly luck too Mm-hmm It’s a hugely random process whether a paper gets accepted or not what kind of award it can get I think this thing is very, very random And if something is this random it shouldn’t be something a researcher should focus on So in your second year you felt life had finally begun

Right, and life finally began and then reality immediately knocked me over Right, um [chuckles] but it wasn’t that exaggerated That is to say, um I think this is another during my PhD time, well again grateful to Professor Tu in that he was actually a very, very open-minded

person who let us explore all kinds of different directions So during my PhD I did 5 internships in total I think even today that seems although with schools and industry already collaborating so broadly I think it’s still hard to imagine Why did you want to do internships? I just wanted to go out and see Mm

maybe it’s the same as traveling when I was young I wanted to know in different places in this world different organizations what kind of things were happening what people were doing what things I wanted to know all of this And on one hand I tell you right, I always wanted to do artificial intelligence or wanted to do computer vision

But on the other hand I would also ask myself What if I’m wrong? Right What if what if right, what if the world has something even more interesting happening what would I do Right, so I think This is another motivation of mine You went to NEC Labs America

went to Adobe went to Meta went to Google Research and DeepMind Right, thank you for the background check Right, yes Those are the 5 places And um actually the first four were all in the Bay Area So I was actually quite happy during that time every year I had my own beat-up car

and every summer I would sublet my dorm room drive my car all the way from Southern California to Northern California Mm an 8-hour drive Sometimes with once or twice with friends but most of the time I was on the road alone I think this was actually quite cool Right, all my worldly possessions in my car two suitcases

not taking anything else because I’d given up my place too when I came back I’d have to find housing again Right, um, no fixed abode this nomadic researcher lifestyle I was still quite happy Which of these 5 places did you like most? I think each has its own characteristics Like among these 5 So I recently also told students

I have many students and their internships actually didn’t produce much good work And I told them I would use myself as an example I said, I did 5 internships and half of them I didn’t produce anything Mm And how long were these internship periods? Generally 3 to 6 months So about half of each year

half the time at school half the time in the Bay Area of course at the low point I was in London And I think it’s not about liking or not liking I would try to diversify Um, that is I would hope each place I went was different I hoped for a more diverse experience So NEC Labs America was of course the first place I went

And I think there I also published a CVPR paper And there, um, there were many great colleagues mostly Chinese people Mm and after work at lunch everyone would go together to Cupertino to eat That’s my impression of it I very, very much liked this group really liked everyone’s attitude toward research

And I also published my own paper So I think I was very happy about this experience Right NEC Labs America back then should have also been a gathering place for deep learning Dr. Yu Kai (founder and CEO of Horizon Robotics) also worked there Yeah Mm Yes Of course, it had two divisions one in Princeton and one in Cupertino (in Silicon Valley, California)

All the vision and media people were in the Bay Area And all those doing traditional machine learning work were all concentrated in Princeton Right And some of what follows we can skip But anyway, at Adobe I just didn’t produce anything The reason is, um Adobe is a very, very artistic

company with an artistic temperament Oh Makes sense And at that time I was in San Francisco And then having me do things related to design and crowdsourcing meaning you’d write some Mechanical Turk internet user feedback systems right, some user feedback systems and using it to guide some

machine learning and, um, this kind of computer vision tasks like segmentation this thing I just didn’t do well I still feel guilty toward my mentor Of course they were all very kind Right, but this was also a time that made me realize it’s OK not producing anything is actually not the end of the world

right, it’s not the end of the world But that period was actually quite depressing And this depressive period actually continued until my Meta internship in school also didn’t seem to produce any interesting work And then after going to Meta then, um the internship was maybe only three months In the first two months I basically also

was exploring some things exploring some things also related to neural network architecture some things but also didn’t discover anything worth mentioning And then suddenly a turning point happened This, um He Kaiming (main inventor of ResNet) joined FAIR At that time Right So this was about halfway through my internship

Professor He Kaiming then joined FAIR and became a full-time researcher Mm, so That was my first time working with Kaiming That was my first time learning from him Right, and then And then And we built some deep friendships then I think Because at that time he was coming to America for the first time

It was his first time He had many firsts that were at FAIR right At that time he also couldn’t drive first time in America, unfamiliar with everything I had to drive him out to eat and drive him home sometimes [chuckles] But he later learned to drive himself And he also didn’t know how to use Linux

Mm, that’s also very interesting Right, because at Microsoft they all used they could only program with Windows Right So I had to teach Kaiming how to use the cluster how to use Linux Right, but you’ll find Kaiming this is Kaiming not without reason Right, and I think someone like him truly has this kind of

you could call it an aura or I could call it some kind of reality distortion field this is actually Steve Jobs’s term meaning the people around Steve Jobs, influenced by him would all feel reality had been distorted right, some things that were completely impossible could now gradually actually be done I think Kaiming also has this kind of magic

Right, and then So this was my first time seeing how a truly top-level researcher does their research At that point your internship only had one month left How were you able to build such deep friendship? I think, one is daily life interactions Why did he choose you? Why did he communicate with you? Because I was an intern there

and my manager entrusted me to Kaiming because I wasn’t doing well anyway hadn’t produced anything Then Kaiming came and said, hey Kaiming, you come guide him come join in the discussions Right, so there was still a month left And Kaiming said why don’t we participate together in the ImageNet Challenge

Right, just compete in this competition Mm And then I said, hey Sure, let’s compete in this competition Because when Kaiming was at Microsoft his work came about through competing in ImageNet right, building up step by step Simply put Mm And so we also went to play with this ImageNet

challenge Mm And in this process we discovered hey, some ideas we had thought of before were actually reasonable actually very good ideas Right And I actually proposed this idea to Kaiming Kaiming’s magic is he can take all very ordinary things and turn them into gold-like

valuable ideas So we did this ResNeXt work And then this was also our solution for the ImageNet challenge a submitted solution And we got second place Didn’t get first place But I think we were actually the most effective Should have been first Because the first-place solution was an ensemble solution

which combined some previous algorithms doing model ensembling a combined solution Right And we were actually a completely new framework Mm Right, and then And at that time Um Right, I think I think what ResNeXt wanted to convey is also about how we by modifying the neural network architecture

learn a more scalable right, a more extensible representation such a representation this thing is also very interesting because this idea is very, very simple It says originally for example, my ResNet is just a serial network right, just layer by layer by layer like this conv layers

now I can in parallel expand into several different groups each group with its own small network so you have networks within a large network distributed in parallel with many small networks Mm why is this interesting because in today’s terms this is MoE (Mixture of Experts) Oh

So So at least on ImageNet at the time we already saw a kind of scaling behavior that is, the more groups you have the more sparse your neural network becomes and the more sparse your neural network the wider it gets but you can at the same flops computation level get better results it converges faster

and your final results also improve I think this resonates with what people are doing with MoE today aligns very well Does this work count as an extension of Kaiming’s ResNet? Yes, yes So why is it called ResNeXt Kaiming said, right this is Xie’s ResNet so the x is both next the next generation ResNeXt and also

Um giving me some giving me some credit Mm I think Kaiming is someone very good at naming things Right at naming papers many later papers were actually named by him for us Mm Would he hide people’s names in them? Not really Not really

not every time but it was just a clever touch I think this is also part of his research taste Then why was your name hidden in it? I don’t know I think maybe also Ah I actually don’t know I never asked him Mm How long had you been working together at that point? Did your internship get extended?

All of this happened in that one month Right, it all happened in one month This kind of thing is countless Many of my best works actually follow the same rhythm starting out unable to produce anything Oh and then at the end suddenly a burst of inspiration and then converging on this thing research is never a linear development

or a linearly developing research is never good research Mm And then Much of our work is actually non-linear I can tell you more stories later Mm Um, right Anyway At this time it was with Kaiming And I I finished and that period ended But your friendship continued, right?

I think so Right And then went to Meta This was a productive internship I think it was a productive internship And at Google? At Google I think it also went pretty well Because I started to learn how video works Right, these internships were all different from what I’d done before

Each internship was a different topic from what I’d done which led to my final dissertation actually, on the surface looking scattered but I was still able to find a way to connect them and I’ll tell you the way to connect them shortly Good But, anyway, at Google I went to study some video this kind of

neural network architecture and training process and what it should look like I think it was also quite rewarding Hey, I have a question Because you worked so well with Kaiming at Meta And then and he’s a very famous AI researcher why didn’t you stay and continue collaborating with him I think many people might make that choice

why did you keep going to other places to explore Um, this is actually Kaiming’s suggestion Kaiming would advise everyone to intern at different places this is the only way to maximize your gains Right So like us back then me and Wang Xiaolong we had all done one internship

And then um, we of course all wanted to stay but Kaiming said go check out other places maybe there will be different gains Mm But after your PhD you returned to Meta Yes, right I think I think also after finishing the Google internship I immediately went to intern at DeepMind I think that experience

was actually very enlightening for me Mm, at that time DeepMind wasn’t yet Google Had it not been acquired yet? No no acquired, acquired already acquired but they were two different organizations because it, um, was only in London Right So during that time I went doing some RL-related research

Ah And the reason was I really didn’t know how this thing worked and I wanted to go and see And it was very painful doing it And London’s winter that period was winter so cold London winters are also very cold I still remember very clearly I’d get off the London underground working until very late

at night maybe 10 or 11 o’clock and the biting cold wind mixed with rain hitting my face and clothes and hat couldn’t block it step by step back to my tiny room Right, the temporary dorm It was actually quite hard Right But that period for me I think was also very enlightening First made me feel like I didn’t really enjoy doing

RL (reinforcement learning) related research Or rather I didn’t enjoy robotics-related research Robotics Because at that time RL was actually in this kind of virtual environment simulated environment doing some embodied agent tasks Mm But I think my bigger gain actually came from

my understanding of an organization like DeepMind being built up at that time Mm I thought, wow this place is so different different from everywhere I’d been Right They had a very different management model For example, they would have many PMs coordinating different research teams and the operations between them

They would have these different working groups where everyone still had many bottom-up ideas these bottom-up ideas But there wasn’t a top-down management model and it was also a hierarchical management mode Starting with purely exploratory ideas where everyone could have their own small group to do some early studies

and then immediately transition to once something takes shape it would immediately enter a more top-down more organized management mode I think this is very, very interesting And thinking back now Right, I also mentioned this on Twitter before That Demis also met with many interns And everyone organized a meeting And Demis said to everyone

or rather, someone actually asked him this question Saying, hey what exactly is DeepMind’s mission this company what do you ultimately want to become as a company Demis’s answer was DeepMind will ultimately become a company that can win multiple Nobel Prizes able to win multiple this requires

key point: a company that wins multiple Nobel Prizes I think we all said back then, wow that’s so ambitious isn’t that a bit far-fetched they’re just doing AI But now we see they have already achieved at least one step I think I think it’s truly very, very admirable Actually the entire AlphaFold team

was in the process of forming during my internship gradually coming together Right I could actually see which people were doing these things And at the beginning some interns were also participating in this process and step by step how it went from an exploratory idea to gradually becoming organized focused on execution

step by step able to achieve ultimately completely changing the world such a project’s process The organization question we’ll discuss in detail later Mm, I’m thinking did you do too many internships so you didn’t get any more best papers after Mm I think that might be the case

or rather, I think what I did was maybe too much, too scattered Which year of your PhD did you start internships? from the first year Oh, from the first year So these two were always intertwined Mm, right So I think you’re very right actually my timeline was disrupted Right, it does lose some focus

But I think this was also a design of my own So coming back how to connect all these things I think my doctoral dissertation title is Um this Deep Representation Learning with Induced Structural Priors roughly about some structural priors Um using these priors to guide us how to learn a better

deep learning representation Mm And this again, many many years have passed but I I find what I’m doing now is still this And then And at this conference in November or December there was a workshop their workshop title was Representation Learning with Induced Structural Priors roughly about using structural priors and representation

a topic roughly like this And I gave a talk there And at the end of my talk I said, actually over the past 12 years your workshop topic though still a frontier now we are discussing it with some different meaning But this was also the problem I wanted to study at the beginning and also what I feel now

is still not fully solved Right, so on one hand I think during my PhD the timeline was a bit fragmented The reason is I was doing different things in different places But on the other hand This is also, if you want to tackle representation learning as a topic this is also unavoidable because it’s like planting a tree

your representation is actually the root of this tree after this tree grows it needs to have different branches Right each branch is actually a different what we call downstream application a new application So I’ve done image recognition image segmentation edge detection video recognition

action recognition right, and even later some embodied RL-related tasks when doing all these things the problems I saw they are all branches on those tree branches they are not roots Right I think it’s possible what you said is right I haven’t considered this whether I would have more best papers

[chuckles] but I hope to plant more of this tree and put down deeper roots rather than going further on the branches Right, mm And I think, again I think this is the core of deep learning that is, representation learning Representation Learning is basically equivalent to deep learning Let me explain to everyone what representation learning is

Um Good question, right, this thing Um, I think I think the reason I like saying I am someone who does representation learning is because this is still hard to define Mathematically speaking you can think of representation learning as you have data right, x and you now want to map it to a space

and now this space might have some properties these properties maybe these good properties may make it easier for you on downstream tasks to achieve better results Right So what you want to learn Um from the initial data to this well property space mapping function

this is what is called representation learning And then this function is also not just a simple mapping it might be a hierarchical hierarchical mapping And now of course this can be implemented in different ways now the mainstream implementation is to use a non-linear neural network to implement this function

Right, so I think this is a definition But I just said I would I would be willing to say I myself am someone who does Representation Learning is because I think this is a timeless title because this field develops too fast many times we do many things including, let me give an example this might be a very, very

very negative example which is that in the past, actually when I at what time maybe just after finishing my PhD something was very, very hot called NAS (Neural Architecture Search) which is called neural architecture search I don’t know how to translate it it’s Neural Architecture Search

Mm Um, in this field there is a lot of consensus that this kind of topic wasted about two years of the entire field This was a wrong direction Everyone went down this wrong path publishing thousands of papers but ultimately got nothing out of it Mm And so Why do I say representation learning is a good

title like that or I am willing to tell everyone I am someone who does representation learning is because this is a fundamental problem If you say now I am someone doing Neural Architecture Search then this becomes very problematic It’s possible after 2 years you’d have to immediately change fields You’d have to update your website

My research direction is Neural Architecture Search delete that sentence and replace it with the next more fancy or different term It is not a timeless theme It is not a timeless theme Mm Representation is a timeless theme the most fundamental theme and a theme that has not yet been solved

Mm So, ah, hey I may have talked about my PhD a bit too long [chuckles] But But I still want to say That is to say, I think during my PhD I also experienced more setbacks For example Our initial Deeply Supervised Nets paper this also At first we submitted to NeurIPS and got a pretty high score

something like 886 a score of 887 but was ultimately still rejected And this was also a blow to me Mm, I found, wow Publishing a paper is actually this hard Even with very good reviews, it was still rejected for some ridiculous reasons, and got rejected. What was so ridiculous? The ridiculous reason was that

we had a mathematical formula in the paper, which should have been squared, and we had a typo — we left out the squared term. Didn’t write it. It was purely a typo, very easy to fix. But the PC said — the Program Chair, the person responsible for these conferences — said this makes your math invalid,

it’s an error. And during the rebuttal, when responding to the reviewers, the reviewers didn’t see it, so unfortunately there was no way to fix it. So at that point all we could do was Now it seems unimaginable. First of all, nowadays perhaps people don’t check the formulas in papers anymore.

Second, I think people have become relatively more tolerant. Back then, people were extremely nitpicky about details. Yeah, right. But it’s fine. We ended up submitting to AISTATS — another conference — a machine learning conference. And that paper won their Test of Time Award last year.

The Test of Time Award. So I think After all this time. Right. Because all Test of Time Awards evaluate things 10 years later — at the 10-year mark, among all papers published 10 years ago, which paper had the greatest influence on the field. Right. So I think I suddenly felt at peace again.

I think Research truly is a long-term process. And so, That’s also why I tell many of my students this: And I think don’t worry about your wins and losses at every moment. Or, to describe it mathematically, don’t worry about a point estimate. Don’t, on this timeline, at every point,

evaluate whether you’re doing well or not. Because all evaluations are ultimately an integral. You need the accumulation of time. In the end, look — everything you’ve ever done, added together, determines whether you’re a good researcher. But in that moment, you’ll still feel very down. Very down. Right.

Extremely down. In that moment it’s hard to think about 10 years later. Hard to think about what happens 10 years from now. Mm. When you finished your PhD, what expectations did you have for your life? You had published some good papers, you had 5 internship experiences, did you think you should go into research or into industry?

Did you make that choice? I was never very confident back then. At that time I never even considered a faculty position. Because I thought I didn’t deserve it. [laughter] Because Why did you feel unworthy at every moment? It’s a bit better now. But, uh, Maybe that’s a bit of an exaggeration. It’s not that I really felt unworthy.

But compared to my peers, they were on the established track, like I said, moving step by step toward a good faculty position. That path. I felt I wasn’t on that path. Oh. Or rather, What you just said makes a lot of sense. If your final destination was really a faculty position, at least at that point in time,

you shouldn’t have gone to 5 places for 5 internships, working on 5 different projects. That’s very unfavorable for finding a faculty position. If you wanted a faculty position, staying in Kaiming He’s team would have let you publish more papers, gotten more results, during that period, it might have been a smoother path

toward a definite goal. I don’t know if it was a definite goal. I really think it’s quite mysterious. All these decisions came down to: I only thought about where I should go to do what I most wanted to do, ideally with the people I most wanted to work with. Working together. I think This idea is actually very, very simple.

So when job hunting back then, actually I I was looking everywhere. There were quite a few offers from major companies. Right. and I’ve talked before about my OpenAI interview experience. It was actually pretty cool. Basically, I was in a small dark room for five or six hours, working on one problem.

When I came out, it was already dark. Right. I found the experience quite fascinating. It felt quite extraordinary. But back then actually Who was the interviewer at OpenAI? John Schulman (OpenAI co-founder, Thinking Machines co-founder and Chief Scientist) Oh, right. I saw you wrote about this experience on Zhihu. Right? Uh, not on Zhihu,

it was on Twitter, on X. Right, Zhihu reposted it. That’s it. Yes. So his original interview questions were on a single A4 sheet of paper, handwritten in pencil, line by line, handwritten interview questions. I think it really moved me deeply. I found it so fascinating.

This place is very interesting. And, uh, In the end, Actually, There was an offer, of course, but in the end I didn’t go to OpenAI. I didn’t go to OpenAI. This is where the timeline — quantum mechanics — starts to diverge. That was 2018. So early. Mm.

So if I had gone to OpenAI, maybe, uh, you’d now be part of the LLM world. Maybe. I don’t think so. I don’t know. I don’t know. Don’t know what would have happened. Back then I didn’t even think about it. I just wanted to go to FAIR. If FAIR gave me the offer, I would definitely go. Your reason for wanting to go to FAIR was Kaiming?

Uh, right. Kaiming, Piotr Dollar, Ross Girshick. The so-called the three pillars of computer vision back then. They weren’t that senior — university professors or anything like that — they were all young to mid-career, researchers. But the absolute top three. Right, they were there.

And the research they were doing was the absolute top-tier computer vision research. So for me, there was no choice to make. So it was kind of fun back then. Here’s the thing — Ilya (Ilya Sutskever, SSI founder and CEO, OpenAI co-founder and former Chief Scientist) called me, and I said almost nothing, and I rejected OpenAI.

They sent me an offer, and I said I’m not going, sorry. What did Ilya say on the call? Uh, he was very angry. He asked me, “Why didn’t you even discuss it before rejecting the offer?” “Is the money not enough?” How much was it? Uh, I don’t remember exactly. It was actually very, very low.

Maybe, uh, probably in the hundreds of thousands. Back then the pay for a top PhD student around 2008 would be roughly $400K to $500K dollars. Dollars. Right. And now it’s at least tripled. But anyway, at that time OpenAI was at that level too, which was fine.

Right. And then But Ilya was very angry. So I I could only give vague responses and told him that I couldn’t go. and At that time indeed what did he say when angry? Uh, not much actually. His tone was just very stern. Why did he decide to make this call? I don’t know.

That shows he really cared about recruiting. He had never been rejected before. Uh, no. I don’t think that’s the case. In 2018, I think he was probably often rejected. Because FAIR at that time — not just in Vision — in many areas, for the top PhD graduates, FAIR was more certain than OpenAI,

more open, more like an academic environment. Such an institution. I think, at least at that time, everyone around me, if given that choice, unless they really wanted to do what OpenAI was already doing, the things OpenAI excelled at, I think most people would still lean toward FAIR. Did you get the FAIR offer smoothly?

Uh, not that smoothly. I think it was also quite rocky all the way. When you rejected OpenAI, was it because you already had the FAIR offer? Yes, right. But at FAIR, I gave a talk, this talk — I had no experience at all, it seemed everyone at my stage was quite experienced at job hunting,

while I knew nothing. So I gave a talk, and, uh, the talk was scheduled for one hour. Normally you’d speak for 45 to 50 minutes with 10 minutes for questions. But I finished in 30 minutes. Done. Everyone looked at each other, not knowing what to do. of course, many of the researchers there

gave me a lot of face and asked many questions, so the time was somehow stretched to 45 minutes. It wasn’t too awkward. Later Kaiming told me that everyone thought this was first, very unconventional. How could you finish so fast? Second, Maybe interviews should all be like this — a 30-minute talk works fine,

saves everyone’s time. So many times I’ve done things without doing them perfectly. Hmm, why did you finish so quickly? Why didn’t you follow the rules? I didn’t know there was a rule. Oh. Didn’t read it. Uh, I didn’t know about this rule. Like now, for example,

Because this rule is actually a job talk rule. Nobody told me this rule. Right, people just said, “There’s a talk starting at 11,” but this is actually an established convention because that’s how academic interviews work. and FAIR back then was actually an academic institution. Mm. It was really like a university.

Its operating model was like a PI leading a group of young people — whether interns or newly joined members — working together. And when I joined FAIR, I was probably among the first few — I’m not sure — Chen Xinlei was probably the first, but I was probably the second — a fresh PhD graduate who could join FAIR.

At first they didn’t recruit new PhD graduates. If you were just a PhD graduate, they didn’t want you. They would only recruit people like Kaiming, who had already done very impressive work, those kinds of researchers. Mm. Right. So I was also quite lucky. Right. Mm. I think FAIR really was the holy temple at that time.

Mm. And so, I didn’t agonize much over too many other possibilities. Mm. And then About the Ilya situation, let me add one more thing. I’ve only talked to Ilya on the phone twice. This was the first time. We can talk about the second time later. It was in July 2024,

right after he founded SSI. He emailed me and asked if I’d be willing to come work together. And you rejected him again. Uh, right. Why this time? This time because I had just started at NYU. and Mm. I think there were several reasons. When I talked with him, Uh, the main topic we discussed

wasn’t salary or anything like that. We didn’t talk about any of that. The main topic was how to give future artificial intelligence the ability to love. the ability to love. Discussing philosophy. Of course, I finally asked him one question. I asked how he viewed multimodality, how he viewed computer vision,

or general perception models — what did he think? Ilya’s response was he felt this was already solved well enough. Okay, so I thought maybe, uh, SSI has its own language-based approach. And that approach, at least for now, is not the path I want to pursue. This is your fundamental disagreement —

LLM versus vision. Right. We can talk more about this later. But I don’t actually see this as a disagreement. I see it as an organism. Everyone is just in different places, doing different things at different times. I always like to say, “Brothers climbing a mountain, each making their own effort.” Everyone doing their own thing.

No problem with that at all. It’s not a fight to the death. LLMs don’t conflict with what I want to do. And without the recent developments in LLMs, there might not have been the current state of computer vision. Mm. That topic you discussed — how to give AI the ability to love — did you reach any conclusions?

The conclusion is that this is very important. Why? Because without it, we face a very uncertain and very dangerous future. But with love comes hate. They’re two sides of the same coin. It can’t only have love. When it learns to love, it will definitely know what the opposite is. For me,

completely agree with you. Mm. This becomes a philosophical proposition. Mm. But let me ask a counter-question. Why do people trust their own children, trust humans so much, but have such worry and fear about AI, this new form of intelligent entity? I don’t have an answer to that. But I think

There will be technical ways to have control. We can use technical means to make AI more trustworthy in the future, safer, and more controllable. Mm. Controllable. And this is also one reason why we need to work on world models. Why did he want to reach out to you? Uh, I don’t know.

Maybe he reached out to a thousand people, ten thousand people. I guess. Right. When we were waiting in line at a restaurant that day, we actually walked through the streets of New York together, and our conversation naturally extended to people who have greatly influenced you. In what you shared just now, the human factor

takes up a very large share of many of your choices. Why are people so important to you? And in your personal bio, you clearly listed which collaborators are important to you. That’s very rare. Why are people so crucial to you? Is this unusual? I don’t think it’s unusual at all. I think In academic circles,

this is a common behavioral pattern. People organize themselves into these social networks. Mm. And these people shape your thinking, because they may be your students, they may be your teachers, right? But teachers don’t always teach students. Sometimes students teach the teachers. All of this can be true. So it’s a huge graph

where everyone is connected. And I think That’s also why research, or science, is especially fascinating. Mm. Because many times the mutual trust between people, mutual appreciation, mutual feelings — these aren’t built through living together and being friends. Many times it’s through scientific discovery,

kind of this research aspect, that connections are built. Relationships between people. I think this is actually very interesting. For example, those who deeply influenced me — I may get to know them personally, of course I try to get to know them personally, right, but that’s not what matters most to me. I seem to understand them through their papers,

learning their way of thinking. and I think that’s the real meaning of research. I don’t think the purpose of research is to publish papers. I I don’t think Uh, publishing papers is the goal. Not at all. The purpose should be — what is the purpose? ah,

Is it a journey through people? What Kaiming told me the purpose is: Mm. at its core it means sharing knowledge. that is, The purpose of publishing a paper isn’t for others to see it, but so that after others see the paper, they have something to work on. that is, You publish a paper, others understand some of the content,

and they feel their own horizons have expanded. Mm. It’s about helping others. Being helpful to others. Right. Being able to inspire others, or enlighten others. Oh, that’s the purpose of research. I think that’s the purpose of research. Or, to put it more romantically, the idea is I think this — this comes from Hannah Arendt (political philosopher),

and she said she doesn’t care about impact. She doesn’t care about influence. Because In researcher circles, people say we publish papers to create some kind of impact, Right? In my own dictionary, I actually have a bit of an aversion to the word impact. Aversion. A bit of an aversion.

Oh. Uh, why? What is it about it that you resist? Again, Arendt said that she felt, uh, the word “impact” is overly aggressive, overly masculine. For her, the purpose of doing these things is not to create impact but for understanding itself. If you can understand something,

the feeling is wonderful. If you can write down what you’ve understood, whether it’s an article or a paper, and spread it, then you can potentially allow more people in the world to understand such a question in the same way you do. And this will be transmitted step by step, creating a kind of resonance.

and Arendt’s view is that she would find in this a sense of family — a feeling of family. She would feel that she understood something, told others, allowed others to understand, which means these people also understood her to some degree. Mm. But humans, as social beings, need to be understood.

Right. He reframed the word “influence” in a very soft way — seeking to be understood. I think so. I think so. You agree more with this view? I agree with her very much. Because I think Creating impact is fine in itself. But it’s very self-centered. Mm-hmm.

I’m going to create impact. Mm. Right. Me-centered. And yes, you’re absolutely right. I’m going to create this impact, I’m going to change the world, but do the people in this world agree to be changed by me? [laughs] Or rather, many disasters in the world are because people want to create impact,

want to transform the world. Right. I think I would tend to agree with this softer expression. I think If all people in this world, through our research, can gain a new layer of understanding, a new layer of knowledge, the total intelligence on Earth would increase. And increasing total intelligence on Earth

is never wrong. It’s always something beneficial to the world. Whether it’s called impact or being understood by more people. Do you want to be known and remembered by more people? Mm. Do you have a need for fame? I certainly don’t have that need. You don’t have that need. But I think I don’t have that need.

But really? Uh, I Or rather, from where I stand now, I’m actually a victim of a kind of false fame. Uh, the reason is people now take some of our papers and post them on Xiaohongshu, to discuss — and actually none of this — people talk about the so-called top-three conferences and promote the work, right?

I I have never once asked any such media outlet to do this kind of promotion. Mm. And I tell my students: please don’t go on Xiaohongshu or Zhihu to promote your own work. You can explain your work, you can comment on your work. That’s fine. Just don’t promote yourself.

Why is it okay on X? I think on X, uh, it’s more about how you define promotion. What I focus on is briefly summarizing things and telling people what it’s about. It’s more like attracting people to look at my work, and I think that’s fine. But the promotion I’m referring to is more like the fame you mentioned,

because what I really can’t accept is people now say “so-and-so’s team” published such-and-such work. Oh. It reinforces that person, reinforcing that person, someone’s team. reinforces that person. Right, uh, If any editors hear this, I hope people can stop doing this.

Don’t write “Xie Saining’s team”. Don’t put my photo on it. Don’t put my name on it. We need to encourage young people more — the people who actually did the work, give them more visibility. Right? Well, people might think you’re the first author. Uh, right. If I am the first author, that’s fine.

But I’m not the first author. Right? I’m just the team lead. And much of this work is done by students. So what should it be called? Not “Xie Saining’s team”. Just focus on the work itself. Talk about what problem this solves and why it matters. That’s enough. Right.

But I think You really hate being used as a target by others. Is that so? Uh, yes. Because I think it adds a lot of risk. I think Mm. Tell us about those who influenced you. We’ve already talked about a few people. Kaiming, Professor Tu — anyone else? Oh, yes. Uh, I think, right,

this goes back to FAIR. We can follow the FAIR thread. After FAIR, I came to NYU. I think this was another decision-making point. Stayed at FAIR for 4 years. A full 4 years. Right. OK. Yes. Yes. Also with ups and downs. For me, I just said many places I’ve been

actually grew alongside me. FAIR might be an exception. When I joined, it was at its peak. The high point. Probably the high point. Right. And then Right. It’s a pity. What’s happening there now. But I also think Mm. Right. Because I left relatively early, so I wasn’t there

when it was at its lowest point when I left. Right. [laughs] I also saw some warning signs. Right. OK. And, but, right. And I think if I’m talking about people who influenced me, then in this process, when going to NYU, I think that was another quite mysterious decision-making process.

Right. Deciding to go to New York at that time — I just mentioned this — was partly because I might enjoy the city. and But I think Uh, another very important thing was also that Yann LeCun is here. Right, Yann is here. Mm, right, uh. Why, with him here, were you willing to go? You worked together at FAIR.

Uh, He likes to say he’s recruited me that is, three times, right? The first time was at FAIR. But at that time, because he was the overall director of FAIR, FAIR’s director, I didn’t directly work with him, but I was influenced by him of course. Or have you had long-term exchanges?

Yes, we’ve talked. Right. But never directly collaborated. Mm. Then going to NYU was the second time. We can talk about the third time later. Mm. And the NYU experience — I think why it matters that he’s here is also because I think he’s a person with a very strong vision.

so Right. I think many of these decisions were very intuitive. For example, NYU’s building, which we call the Center for Data Science, the so-called Data Science Center, this was actually led by Yann over ten years ago. He established this organization. Right. It’s independent of traditional computer science departments

or math departments. It’s a new department. So we have a new building, and the first time I walked into this building, I felt great. Because Everything is glass doors. Right. I can take you to see it sometime. All glass doors. Uh, everything is very, very open. And it feels a bit like a company for students.

And the color scheme is very nice. Right, I keep saying I’m a visual person. There are warm tones in there, with an orange floor, various sofas, and everyone, uh, though it’s quite chaotic — all kinds of robots running around on the floor, various students on this sofa, that sofa, sitting and studying.

And there’s absolutely no privacy — zero privacy. All the professors’ office glass doors — you can see clearly everything happening inside. Mm. Right. But I thought, wow, this is very interesting. This environment is very interesting. Right. More and more American schools now are making efforts like this,

saying we want to have mm, this kind of uh, interdisciplinary cross-disciplinary centers. Right? Usually, like, these AI centers, and using them to attract talent, using them to bring different departments together, because AI really serves as this middle layer, this connecting identity and position.

Connecting everyone. Connecting everyone. Everyone needs it. Right. Mm. Yeah. Whether you’re doing science, right, doing physics, chemistry, math, statistics, business school, and including computer science, I think AI is a very good middle connecting node. Mm, right.

But Yann’s foresight was that he more than ten years ago had already established this. Mm. So I think I think he is quite a visionary person. Mm. Right. And then So NYU’s positioning in AI is also very good. So actually, uh, again, I think the computer science department isn’t the school’s strong suit.

But it has many AI talent reserves. Right. It has gathered many very impressive AI faculty members. Right. Mm. Yann is one reason you chose NYU. There are also many, many reasons. He’s one of them. Because he needed to interview me, and he needed to give the final say. Right. Mm.

Or rather, it was he who chose me. Mm. Important people. Are there others? Mm. I think there are. For example, during my time at NYU, I also collaborated with many other professors, and one person who I think influenced me greatly would be Professor Fei-Fei. Right. I think Professor Li Fei-Fei —

uh, everyone should definitely read the book she wrote. Right, her autobiography. Right. And I’ve read it too. But after having deep conversations with her, I gained even more. Right. Sometimes I would tell her I was facing this difficulty and challenge, and Professor Fei-Fei would tell me earnestly

some stories from her past. Mm. And then This was actually a great comfort to me. What kind of stories? Specific things might not be appropriate to share. But in short, her journey wasn’t smooth sailing at all. Mm. She also had to wade through many thorns, overcoming many obstacles step by step,

and now standing on the world stage, becoming a pride of the Chinese community, or becoming a North Star for the entire research field, especially computer vision, allowing everyone to see what she’s thinking and being able to in some sense set some new directions. I think Right, her influence on me has been enormous.

Mm. and And I think Professor Fei-Fei’s greatest strength is that she’s someone who can define problems. Mm. This point is actually not very intuitive. When people talk about Professor Fei-Fei, her greatest achievement is building ImageNet, this dataset. But in fact, this isn’t just a dataset.

This isn’t just data. It’s hard to imagine that back then, right, around 2012 or 2011, image classification wasn’t a well-defined problem. Defining this problem clearly was far more important than building such a dataset — far, far more important. Mm-hmm. And I think Professor Fei-Fei

set this agenda, defined this problem clearly, so that subsequently Deep Learning could have a playground, have such a platform to showcase its capabilities. I think This is her greatest achievement, and also what I always want to learn from. Mm. Right. So I worked with her on two pieces of work.

One is Thinking Space, and this paper mainly involves within multimodal base models, how to solve, better solve this kind of uh, spatial intelligence problem. Well, recently we have another paper called Cambrian-S, and this paper also addresses questions about video — how do we define problems,

which problems are actually important. Right. I think this collaboration with her has also helped expand the boundaries of my research. How did you come to know Professor Fei-Fei well? Uh, it was all quite serendipitous. She came to New York on a business trip once, and we had a meal together. And she told me a lot of things.

Right. And she would often come to New York later, and because she’s also starting a company, we would often get together and chat. Right, roughly that. And normally we’d have some research meetings. Mm. I’m curious about something, and I think many people are curious about this too. Mm. How did you go from being a very young

researcher just starting out in academia, and gradually, come to be alongside these well-known names in AI, come together with them and stand alongside them? That is, how did you enter the core of AI? I I still don’t feel I’m at the core of AI, or that I’ve gotten close to it. Mm. But the people you just mentioned,

certainly many people would love to collaborate with them. Is that so? Ah, of course. Right. I think And look — all of it was serendipity. With Kaiming it was just happening to be there as an intern and getting him to open up. And with Professor Fei-Fei, you just had one meal together. How did you get them to open up to you?

I think this is very hard to do intentionally. Mm. Or this is a bit mysterious. You could call it some kind of law of attraction. Or you could think of it as people whose thoughts align ultimately converging together. Though you may have countless small streams, in the end, they may all converge into one river. I think, for example,

uh, all the people I’ve mentioned, at least they’re all working on vision. Or rather, Even including Yann, who can be seen as doing general AI, but his starting point, right, was also digit recognition, which is also a visual problem. Right. I think everyone’s foundation is very, very aligned.

So I think I really didn’t make these things happen intentionally. Right. And many things, Or rather, I think don’t need to be made to happen intentionally. Everyone is just based on these research questions, and their understanding of these questions, collaborating together. Right. I would think of it this way.

The thing is that from the outside, I’d see you as someone very goal-oriented and very logical. But through our conversation just now, I find you’re someone whose choices are quite disorderly. Right? Right. I think there’s a certain disorder. Mm. But I think this is also a by-design process.

I choose this disorder. I think I think Using this clichéd phrase: “follow your heart.” Right. But in many cases right, there’s no way around it. Many of my choices couldn’t truly optimize for a result. I think this is the source of the disorder. So in these disorderly choices,

can you string together all of your research journey into a single thread? We’ve actually already discussed a few works. Yes. Yes. Yes, right. I think we can go through it bit by bit. I think one benefit is I don’t have that many papers, so so maybe it’s relatively easy to string together. And I think indeed, uh,

I can’t say there’s a hidden thread, but there really is a thread in the background guiding me to keep doing this. Or rather, before talking about these papers, before — I want to say, computer vision has developed for such a long time, right, I have many friends who are slowly exploring new directions, like doing some

robotics, right, or 3D vision. I’m also trying to expand my boundaries outward. But looking back, I find on this main thread, right, I think this main thread for me — representation learning — Mm. there are too many unsolved problems. Right. So I want to stay on this main thread and push forward what we’re doing.

So the starting point of all this, if we trace it back, of course involves Deep Learning, involves Deep Neural Networks, the design of these architectures. I think this part is of course related to representation learning. Mm. And then this is also what I think, in the past, everyone has been working toward.

Not just me. Right. And everyone, everyone is doing this — how to design a better architecture so we can learn better representations and better solve problems. Mm. Right. And then, uh, later on, actually, uh, things start to change. We find that architecture itself isn’t necessarily the most important.

It’s definitely important, but not necessarily the most important, or it’s not everything. So there are at least several different things that intertwine. Right, architecture is one thing, your architecture is one thing, and your data is also important. Mm-hmm. And there’s also your objective — your goal is also very important.

Right? I think architecture determines what you use for training. We can imagine it as having a massive engine. And the hardware of this engine is essentially the architecture of a neural network. Mm. But having just the engine’s architecture is actually useless. You have no fuel. You can’t start it.

Right. So, uh, there’s the data dimension and there’s the objective dimension, the objective function considerations. And, so My subsequent research has also followed this main thread — representation learning as the main thread — advancing around architecture, data, and objective. Mm-hmm. And, uh,

During the time at FAIR, I think FAIR — this full-time job, in the full-time work process — I think one core aspect was that I worked with Kaiming, and Kaiming was leading some self-supervised learning such work, Right. And actually, again, now everyone says Scaling is is

already a buzzword. Everybody’s talking about scaling. Mm. Right. But actually the first person who really told me that we need a scalable model, that we need to make the model bigger and bigger, these were Kaiming’s exact words. Bigger and bigger. Right, yes. Kaiming told me this. What year did he tell you?

Uh, roughly around 2018 or 2019. Right. And then So from the very beginning his conviction was that is, we must make models bigger, make data bigger, and this would allow us to get a better result. I think very early on, Kaiming already had this vision. Mm. Uh. And then so we also

made some efforts along this path. And so I think initially, the discussion about self-supervised learning — including Yann, Uh, he’s a big advocate. That is, he is very invested in self-supervised learning — He has this classic cake analogy. This metaphor. Right, the base layer is

the body of the cake, and this part must be Self-Supervised Learning. On top of that you can have Supervised Learning, right, this is the icing on the cake, the cream on your cake. And further on top is Reinforcement Learning, it’s just the cherry on top, just a little cherry at the very top. Mm. Each layer of this cake is actually important,

but they’re not ranked by importance. Mm. If you don’t have the cake’s base, you can’t get to intelligence relying only on the cherry on top. Mm. Right. So because we were at FAIR doing vision, we were actually paying attention to this very early. But the process of this research went like this:

around 2015 and 2016, people already knew that self-supervised learning was actually a future for vision. So at that time, uh, people would design all kinds of what we call pretext tasks, or proxy objective goals, some proxy tasks. that is, what is self-supervised learning?

I don’t have a label to directly give you, unlike ImageNet, where I have 1000 classes and can directly train a supervised classifier and get a representation this way. In the old days, this is what everyone was doing. Through 1000 class labels, by the way, within these 1000 classes there are 200 dog different breeds.

Even so, this is why ImageNet is so powerful. Right? Even with that distribution, it can still let our neural networks learn good representations. I think this is extremely impressive. But people also see the limitations. Once everything is just Supervised Learning, there are many things you can’t capture.

Mm. Because what it learns — for example, we’re sitting here now, we see these chairs, Right? and we now have a lot of images, of different chairs. Some chairs might be quite ordinary, chairs in a studio like ours, or chairs in a home, or some designer chairs, right, or like an avocado chair,

a chair shaped like an avocado. For supervised learning, you need to map all of this to a single label, this label is called “chair”. So what your network has to learn, this mapping, is actually very, very difficult. Right. And it’s an infinite mapping. It’s an infinite mapping.

Mm. So it can only either memorize, just remember, recite all the chairs it’s ever seen, or this, through what we call spurious correlations, some false correlations, tell you it’s a chair. For example, it may not look at the chair itself but look at the background behind the chair,

or it thinks all chairs will be next to a table, so it uses that to make a decision boundary and says, hey, this is a chair. But this is not what we want. What we want to achieve is, from this very diverse visual knowledge, these visual observations, to gain some kind of common sense, some kind of intuition.

Mm. Intuition. Right. Or some kind of common understanding. So this is why people initially wanted to do so-called Self-Supervised Learning or Unsupervised Learning. A common misconception back then was people say we want to do Unsupervised Learning because labeling data is too hard and too expensive. We need to hire people

to label, spending money and time. We don’t want to do that. But that’s just one very small part of the problem. The bigger issue is, in the eyes of computer vision researchers, ah, everyone knew long ago that only through this path there’s no way to give AI systems this kind of common sense. So in 2015 and 2016,

everyone was very, very creative. That period was actually a quite creative era. People would design all kinds of crazy tasks. These tasks — for example, you take an image, rotate it 90 degrees, or 180 degrees, or 270 degrees. You don’t give these images a label, but because you designed

how to rotate these images, right, and these images and their rotation angles can form a valid pretext task. You can predict how these rotated images were actually rotated. This becomes a so-called proxy task. Mm. Similar proxy tasks also include giving an image, converting it to grayscale,

removing all its colors, but then using a neural network to reconstruct the original colors. Essentially, from a grayscale image, how do you predict the color of each object as it should be. Mm. And there are other similar examples, too many to count. Another example, one last one: let me give one more example.

The so-called Context Encoder — you take an image, cut out a piece in the middle, make it white, and then train a neural network to fill in this empty part. Fill it in. Mm. The rationale behind all these pretext tasks is that humans can actually do this. The reason humans can do this, the reason humans know,

hey, whether this image was rotated 90 or 180 degrees, or what color the butterfly or house in this image should be, what color should it have, or you can predict the information missing in the middle — all these things is because humans, based on some understanding of the physical world, have this common sense,

so they can guess these corrupted signals, these already lost signals, how they should be reconstructed. The masked signals. Right. But back then the problem was a hundred flowers blooming — all kinds of papers, Mm. but none of them worked well. All the results were actually quite poor, all worse than ImageNet pre-training,

by roughly 15-20 percentage points. Percentage points. So people were making some progress, moving forward step by step, but the gap uh, what ImageNet could achieve through Supervised Learning, learned on large-scale data, with labels, Uh, the representation learned with labels, was still far, far better.

Right? So, uh, we did something at that time, and this was done together with Kaiming. And this, this architecture is called called MoCo, Mm. Momentum Contrast, momentum contrastive learning. Right. Even the Chinese name sounds interesting. Right, yes.

Yes, momentum contrastive learning. Uh, I think you don’t need to dig into the specific technical details. Because now much of it is no longer important. But in short, it was the first to take what’s called contrastive learning as a framework and make it actually work, as a paper. And what is contrastive learning?

Also quite simple. We’re now in this Representation Space, in this representation space, there are different points. These points may be the same object or completely different objects. For example, I have several images of this chair, Right? and also some that may be tables, or images of cats or dogs.

These images are all different, but in this space, we can measure their distances. Or we know all these different chairs — their images should be closer, their representations should be closer. But a chair and a cat should be farther apart. Mm-hmm. So this is the basic logic of contrastive learning.

And this is actually not new. This It’s been done for many, many years. By the way, this early work was actually Yann who first worked with his students to do it. That’s very interesting. Of course the problem being solved wasn’t directly Representation Learning, but some Metric Learning problems.

Some metric learning problems. But that’s okay. This was around 2019, I think we gave contrastive learning some new meaning. But But this didn’t come out of nowhere. Actually before that, the entire field was slowly moving in this direction, expanding. For example, there was a paper called CPC,

and another paper called Memory Bank. These two papers were already moving in this direction — using contrastive learning to do self-supervised learning, having already taken several steps. Right, and then this is where I can’t help but admire Kaiming’s ability. I think I think this is also a moment that made me think, wow,

what a top-tier researcher and — or rather, I can’t just say top-tier researcher. Kaiming in my heart is simply the best researcher. How does he actually work day-to-day? Mm, okay. I think there are several points. Maybe we can briefly talk about it. that is, I think he has a kind of extreme focus.

and This focus allows him to have a kind of flow state, called this kind of mind flow, right, he can immerse himself in a problem without needing to consider what’s happening in the rest of the world. Mm. And I find this particularly particularly admirable. And another thing is how does his focus manifest?

I think his focus shows in that Mm. every day, apart from this one problem, he won’t think about anything else. He’ll grab the people collaborating with him to talk about it, and grab other people to talk about it too. In any case, this topic is the main subject of his thinking. Oh. And most of his mental cycles

are allocated to this one specific problem. Oh. This is very difficult. I think it’s extremely, extremely hard. Right, because thoughts are often very hard to control. Yes, yes, yes. Ah right. This is related to world models. Thoughts are hard to control. That’s a good point.

But Kaiming is actually someone very capable of this kind of focused decision-making, able to concentrate. Mm. I actually think there are several points. I think a top researcher needs this ability to varying degrees. They need sufficient focus, they need good research taste. How do you define that?

We can talk about it later. Mm. And they also need a certain steadfastness — you can’t just go with the flow and do what others are interested in. And of course you also need strong engineering skills, research intuition, including when you read literature, you know what’s important

and what’s not. This is very important. You also know that this is actually something quite odd about academia. That is, you have to be able to highlight the key points. Right. The main reason is also that people often don’t state them clearly. You know? Sometimes people simply can’t articulate the key points,

sometimes people are unwilling to state them, and sometimes people haven’t realized what the key points are. But Kaiming’s ability is he can peel away the layers and extract these key points, then tell you, and establish these connections in this high-dimensional abstract space. These connections. Oh.

I find this extremely, extremely impressive. Right. So many times each of Kaiming’s ideas didn’t come from sitting in some corner somewhere, dreaming them up at home. They actually come from constant exploration, extensive reading, extensive thinking, derived little by little. And this

I think truly deeply — influenced the way I do research, and what I now tell my students about how research should be done. It’s about increasing input. Increasing input. And I think there’s actually a paradigm here. Mm, in this, this paradigm is also something Kaiming taught me.

Right, he said actually all these ideas you can’t just sit there and think up, because if you want to come up with an idea Mm. by just thinking, it’s definitely not a good idea. There are really only a few possibilities. The first possibility: you’re smarter than everyone else in the world, so you come up with an incredibly brilliant idea

that no one else can think of. But I think the probability of this is extremely small. So the more likely two possibilities: first, while you’re thinking of this idea, 100 people, 1000 people, 10,000 people in the world are thinking the same idea. So you’ll have to compete with them, and your execution speed may not be faster than theirs.

The second possibility: this is a very bad idea that others have already tried many times unsuccessfully. unsuccessfully. Mm. Then you probably don’t need to try either. Mm. So So I think Kaiming’s greatest influence on me is he taught me how to find a research idea. Mm. How? I think this is a process of seeking.

so Now I, when new students come in, I will tell everyone about a research cycle. Uh, of course I hope it could be longer, but in today’s competitive environment, there might be at most 6 months. That is, from the beginning of 6 months, you need to start thinking about an idea, and then later

you need to write this idea into a paper and publish it. This whole cycle is about 6 months. What does this process look like? You need to have a general direction, you need to know what you want to do. You can’t know nothing at all just saying “I want to do research” isn’t enough. This can be achieved by talking with your advisor,

or with your peers, discussing with your classmates, or through your own reading, developing some general direction, this directional understanding. Mm, right? But you must give yourself enough time and space to explore. And this exploration, this exploration phase, I think

should last at least one to two months. What should you do during the exploration phase? The exploration phase — good question. What do you do during exploration? You can’t just sit there thinking. What you need to explore is constantly hacking things, ah, that is, you really have to be like a hacker, playing with things,

messing around with things. Treat research like a game, like a toy to play with. Mm, this might involve, for example, working through formulas, reading more papers, finding some connections, of course, and perhaps more importantly, actually doing things, writing code. But when you’re writing code,

what you need to note is the code you write is not your initial starting idea or direction, but an exploration process. So the code you write might simply reproduce a baseline, take what someone else’s paper is doing and reproduce it. and And it might also be on the basis of this baseline

to make some kind of extension. Mm. And the most important thing in all this is to find a signal. that is, it’s still a bit like what you just said — all of this decision-making process is actually a quite disorderly exploration process. It’s a what we call stochastic gradient descent. Right?

This is a cornerstone of all machine learning, but it equally applies to research itself and to our lives. that is, In everyone’s pursuit of their ultimate goal, they’re all going through a stochastic gradient descent process. Mm. And I think research is the same. For you, the most important thing in research

is not going from point A to point B. For example, A is an idea, B is a paper, but rather in this process, what kind of signal can you find? Your gradient, where exactly is your gradient? Right. So Kaiming’s view is this gradient itself is the source of your real idea. When you’ve gone through constant exploration,

tried many things, possibly unsuccessful, possibly successful, by the way, it doesn’t have to be a successful experiment to give you this gradient. Sometimes a failed experiment gives you a larger gradient. Right? That is, as long as the most feared thing is not knowing which direction to go. Mm.

So a good result, a bad result, are both good results. For research, a surprise, something surprising, such an observation, is always for a researcher — for a researcher — the most joyful thing. Something unexpected that you observed. Right. You saw something unexpected.

Mm. so he said, It’s after this kind of exploration In this process, the ideas you discover are the truly your own ideas. The idea you started with isn’t your idea. That thing doesn’t belong to you. The idea found in exploration is your own idea. And the research process is about finding

your own idea. But this word, you need to see it belongs to — this thing is truly your own. Like heaven gave you an inspiration, injected it into your head. Right, on one hand heaven gives you inspiration, on the other hand, it’s also based on extensive empirical work and practice. Right?

There’s no free lunch here. Maybe you’re truly a genius, or maybe you’re extremely lucky, God holding your hand wrote this formula. It can happen. But most of the time, most progress, even most work that has great influence on the field, I think still happens step by step. You can always trace back

to find its starting point. So I also tell students what’s actually the worst kind of research? It’s when you define a problem at the start, say this is my idea, and in the end publish a paper whose idea is exactly the same as what you started with. You didn’t encounter any obstacles, you didn’t encounter any difficulties.

Why is it the worst? Because this shows your idea is a boring idea, and your published paper is a boring paper. Right. I think after many years of observation, this is indeed very, very accurate. So I think this is also why I tell students this — because people sometimes can’t accept this fact.

People always think I should start by thinking of a clever trick, then implement it, make it work, publish a paper, I’ve succeeded, and I move on to the next thing. But what this can give for personal accumulation is actually very, very limited. The exploration process is actually very difficult. Many people don’t know how to explore.

Exploration is very hard. And this is why all these papers in my view are nonlinear. This nonlinearity shows in two aspects. The first is your 6 months of time — by the 5th month, like I just told you, your mindset collapses. This ResNeXt story — on one hand people hear, wow, you changed direction in the last month

and made it work. That time period is so short, and you still managed to do it. It sounds unbelievable. But once you know this happens too often, you find there really is a pattern. You often go through this. I often go through this. Or rather, my best work always happens this way. So how do you maintain your mindset for the first 5 months?

Uh, there’s no way around it. You have to accept this fact, you have to be able to tell yourself this is a normal research process. Would you consider switching direction in the first 5 months? I might go for that boring idea. I think you would. And changing direction is actually very, very important. You must learn to pivot.

Because I just said, the worst work is when your starting idea is the same idea as your ending idea. The best work is when you’ve gone all around, jumping here and there, taken a long, winding road, and only then arrived at this point. Mm. Though this road is very bumpy, from the final destination

step by step you can always trace back to the very beginning. Only then can it be connected into a line. Only then can you but during the process, you can’t. Yes, during the process I think you’re in the process — because you don’t know, you can’t predict the future. So this is always an exploration process.

So I think about two months of exploration, gradually forming an idea, then gradually expanding, then scaling up, Right? then supplementing experiments sufficiently, This thing, might take another two to three months, and finally writing the paper — then spending one to two months — this is

already a very smooth research process. Mm. And I think this, again, in today’s era, faces many, many challenges. People face all kinds of pressure. Right? I think the competitive pressure now is too great. The competitive pressure is too great. and I think It makes people feel

they must chase the cutting edge and finish things as soon as possible, seize the opportunity. Mm. Claim the territory. but looking back, I think, as I just said, Professor Fei-Fei’s greatest strength is that she’s someone who can define problems. If you lose the ability to define problems,

you essentially also lose much of the ability to innovate, essentially also lose the ability to do research. And this I just said research is nonlinear, that’s in terms of time. But in terms of results, it’s also nonlinear. Mm. That is, this is actually MIT professor Bill Freeman — he has a very classic

plot, an illustration, this kind of graphic. He often talks about it when giving talks. So, This graphic has a horizontal axis and a vertical axis. The horizontal axis starts from a very poor work, a decent work, a very good work, an exceptionally impressive work. This is the horizontal axis.

The vertical axis is the impact on your entire career. The impact of this paper on your career. So you can guess what this curve actually looks like. Right? It’s not a linear curve. It’s not that a very poor work has a very bad career impact, and and the best work or a fairly good work

gives you a very good return, gradually increasing. It’s not linear. It’s not linear. It’s saying basically, a very poor work actually won’t hurt you much, nobody cares. Mm. No one will notice. A decent work — no one notices either. The gains it brings you are also small.

Mm. But sometimes, when you produce a very good piece of work, an exceptionally impressive work, work that everyone knows about, your impact — I said I don’t like the word impact — reaches the top. This thing, this, immediately shoots up to the top. Right?

So we often say in academia what people measure is the so-called signature work. Or another way to put it: people say what you optimize for is not an average — not the average of all your previous work — an average. But what you’re optimizing is the maximum of your work. Right, the highest point.

I think this illustrates the research game’s nonlinear characteristic. Mm. So is the highest point good or not? Of course it’s good! that is, You you only need to succeed just once in your lifetime. And this I actually gave a talk about this at CVPR, I called it research: the infinite game.

Mm, right? This got quite a strong response from everyone. I think actually I rarely give these non-technical talks, because this is more about philosophical thinking and some summaries. That one was actually quite good. and But it also contained everything I talked about above.

Because think about it, research as a career, a researcher as a profession, what is its true essence? Oh. It’s not a chess player, it’s not even a Winter Olympics athlete. Because for a chess player and an athlete, your final achievement depends on your worst step

to some extent. You have to ensure every step, your moves must be correct. If you make even a small mistake in the middle, if you make a small error in chess, placed a piece wrong once, you’ve lost. You’ve lost. Right? So this is a finite game. In this process, there are always winners

and always losers. But a researcher is more like an inventor: you in your lifetime truly only need to succeed once. Mm. If you’re lucky enough, you can succeed a few times. Twice maybe. But you don’t need to succeed 100 times. Two times gets you to the top? I think I think so. Oh.

So I think this is actually quite interesting. so I think as the entire field moves forward, there needs to be some reflection. I think now, the traditional academic world, whether it’s its social responsibility Or rather, its positioning in the entire research landscape, its positioning,

was always the one setting the rules of the game, always the one deciding where we go next. Right? Now it’s completely different. Now the ones deciding where things go are OpenAI, ah, maybe Google, or Meta or other major companies. Right, they’re playing a finite game — they’re playing a finite game against each other.

But this has caused them to drag academia into a finite game, this kind of decision-making chain. Right? So you see many times when a major company releases something, whether it’s called some o-series, or some GPT series, or the Nano Banana series, a specific piece of work, a product launch,

immediately everyone in academia swarms in saying, how can we within this paradigm, using what you’d call peanuts of resources, resources as few as peanuts, these resources, Mm. try to chase it? Oh, chasing. What’s the point? Reproduce, right? Or maybe people don’t believe they can

people might also — right, as you said — they probably can’t catch up anyway. So it becomes some kind of reproduction in a sense, or building on top of it through I think this kind of research process is actually very, very painful. Because there’s one more thing I haven’t mentioned. For the past two years at NYU,

I’ve actually also been working part-time at Google. Mm. Working part-time. And this was in the Nano Banana team, right, in the Nano Banana team, the team within GenAI. and This went on for two years. so Not sure if I should share this, but let’s share. Sometimes I tell some friends,

the reason I went to do this work at Google is I wanted to see what people at Google were doing, so I would know what not to do in academia. Oh. That is, I need to know what you’re doing, so I know what not to do. Because if I know you’re doing this, why would I do it alongside you? Makes sense. Because they have more resources.

it has more resources. No need to compete with them. Yes, yes, yes. So this is also something that guides us. Right, I don’t want to be too preachy. By the way, this disclaimer: all of what I’ve said is only based on my experience at NYU, not particularly successful, just sharing some experience. It doesn’t represent the diversity

and complexity of research worldwide. And looking back, I can also say some papers I do want to share with everyone, but looking back, I haven’t produced a paper that I truly think has real value. You’re saying this to tell everyone I haven’t reached the highest point yet, I haven’t reached that Max yet.

You’re right. I’m still young. [laughs] I can still work harder. Mm. But it really is like this. Because yesterday I was thinking about this question. I think there might be about 20 such papers, twenty-something papers, and that have profoundly influenced all of deep learning and the progress of AI.

If this world has 20 such papers, or 25 papers, and I don’t have a single one. What reason do I have not to keep working hard, to keep going? I think this is a goal. Doesn’t DiT count? Uh, I think it counts as 0.25. Or DiT is more like pushing along the tangent of the research frontier,

taking a small step forward. If we didn’t do it, someone else would have. It doesn’t completely belong to you. Right, it doesn’t. Completely belong to me. Mm. You’re right. Yes. Yes. But these Or rather, I think Diffusion Model certainly counts,

including maybe DDPM counts. Right. and I don’t know. Maybe we can list some. I think this might be quite interesting. I think LeNet counts. I might not be able to list them all. Okay, let’s just list some. Papers that have influenced AI’s progress, right? Right.

Or rather, I think in my view, these are things that can truly be called signature works, Or rather, works that I’m still very far from. Right? I think ah, LeNet of course counts. AlexNet of course counts. Mm, and then ImageNet of course counts. ResNet of course counts.

Mm. R-CNN or Faster R-CNN, the detection part, of course counts. Kaiming’s already on there several times. and What else? Transformer of course counts. Attention is all you need, of course counts. GPT-3 of course counts. BERT of course counts. I think CLIP counts too.

ViT I think counts too. Vision Transformer, I think counts too. And GAN, I think counts too. Okay, can’t list them all. Roughly at that level. Including in 3D, NeRF (Neural Radiance Field), Gaussian Splatting, I think both count. They all count. so

Across different fields. They all have these works. The significance of these works is that everyone was originally gradually moving toward a direction, ah, and then suddenly a paper like this appears out of nowhere, completely changing our just-mentioned stochastic gradient descent process. So you see its convergence curve

has a drop. Mm. This is how I define this. And I think assuming this long river of history means this curve continues forward, right, there are times and times again kind of kind of kind of allowing everyone to break out of previous local optima or enter the next stage —

such papers appear. But I think we’re still far from done. This path is far from convergence. I think there are still many things to be done. I hope I think it doesn’t need to be me personally, I hope but at least I hope to be able to participate. Right. I hope assuming there’s a next revolution,

I hope I hope looking back, Right? I said maybe it’s not about creating some impact, but because of my personal experience, the patterns of collaboration around me, my own understanding, my own thinking, I am able to understand certain things, and what I understand can somehow have some influence on

the world’s or AI’s development. Mm. I think this is something I care very much about now. Mm. Is there no hope from LLMs for this? The next revolution. Again, I think absolutely not. No hope? or I would say LLMs will eventually fade. No no no.

LLMs will never die, but will eventually fade. Old soldiers never die, they just fade away. Right? Why will they eventually fade? They won’t die. They will just fade away. That is, it will definitely have its value, it’s a very good tool. I use LLMs every day now.

But it’s not the foundation for building a universal, a general intelligence system. It’s not the world model’s kind of foundation of this building. World model, we’ll talk about it later. Your work — do you want to expand on it? You’ve already let me say a bit more. Is there time?

Yes. You’ve already said you haven’t reached Max. Yes, yes, right. Put that way, it seems there’s nothing much to talk about with these works. But I think there’s still some significance. Because Just like I said about non-linear research, right, in a paper, we first do some things, then gradually

build up some reserves, and then in the last month, find a new direction, deliver the final result. Mm. I think, When I look at all my previous work, I also have this feeling: I’m still in that initial confused exploration phase. But who knows — maybe this year, maybe next year,

maybe I suddenly right, have a spiritual awakening, and can produce some more meaningful work. Mm-hmm. But I think the foundation here is as I just said, it needs to be able to string together a thread. Or rather, it’s actually not a line, it’s a graph. It has different nodes,

different nodes connected to each other, each node is a paper, all with connections between them. Your subsequent papers are all influenced by all the previous papers. Mm, right. So later, for example, Contrastive Learning, making it work means we saw for the first time in visual tasks MoCo

such work, especially we had V1, V2, V3, right? And in V3, we used Transformer, and we scaled up, Uh, actually already better than the representation ImageNet could get, across all kinds of tasks. This for us was actually a major surprise. Mm. Mm-hmm.

At that time, at that point, I thought, wow, everything is flourishing again. Our problem can basically be answered. We found a way — self-supervised learning — that can work. Going forward, we just need to scale up what we’re doing now, and the future is incredibly bright.

But unfortunately, this also didn’t happen. Right? But before that, we had another paper, also MoCo and MAE by the way were both projects Kaiming led. Actually, people say what does it mean to lead a project? I think Kaiming truly demonstrated this leadership — that is,

he truly took on 80-90% of the first-author plus last-author or corresponding-author responsibilities. The corresponding author’s responsibilities. He needed to write the baseline himself, run many, many experiments himself, finalize the paper himself, tell the story, present it, all of these things basically Kaiming did single-handedly.

And accomplished it. So what about others? Others, we of course also participated and made contributions. But I’m just saying this is a path Kaiming led. Right, we accelerated the progress of this, and may have made the results much better too. Mm. But it doesn’t change the essence of this.

Right. So this is Kaiming. Even now, for example, just a couple days ago he told me he really enjoys this kind of IC work — individual contributor, the individual contributor type of role. Mm. He doesn’t enjoy managing a large team, getting everyone together, just being a manager pointing the direction.

He doesn’t like that. How many people does he manage now? He has many, many people. He now has many undergraduates visiting him, and he is also doing a lot of really great work. So I actually don’t believe him. I tell him, “You’re actually a very good manager.” At least for me,

even though you never really managed me, just being around you, I could feel my own efficiency improving, feeling like I was getting smarter. I think If I were going to have a manager, I’d want one like that — Right? one who can empower the people around him to get better. Right. I think this is Kaiming.

So MAE — in any case, we explored the Contrastive Learning path, and found it couldn’t scale up. So we wanted to switch directions. So we went back and used a simpler approach, which is a kind of denoising autoencoder, this kind of autoencoder, the Masked Autoencoder (MAE).

This method is even simpler. Everyone can go read the paper, But in short, but basically by taking some images and corrupting them, then reconstructing these noisy images, cropped images, or masked images, to learn representations. Mm. This fundamentally different from Contrastive Learning,

but its results were also very good, although it has very different characteristics. For example, it doesn’t explicitly model certain environments this kind of invariance which causes it, when doing linear probing, to perform slightly worse but with untuned fine-tuning these are two different ways to test representations right, in that case the results would be much better

in any case, they have different properties the representations they learn also look different and these things would have far-reaching consequences down the line we can talk more about this later but this was MAE at the time we thought wow, MAE is incredible MAE should at least win a best paper award, right? turns out it didn’t

scaling up MAE would solve all problems, right? turned out it didn’t scale up either right actually I heard you and Xiangyu (chief scientist at StepFun) had talked about this before because he also paid attention to self-supervised learning he actually also talked a lot about why self-supervised learning can’t scale up

some of the reasons I won’t go into it again here feel free to go back and relisten to that episode but anyway, in short, back then there was this kind of rollercoaster ride on the one hand, we got some really good results but on the other hand, these papers were just papers we were never able to truly deliver something real

right, like GPT that could point everything toward a completely different scalable paradigm for the future yeah, right I think this whole thing had, at that point, kind of come to a close of course, at that time I also did some other work for example, I extended self-supervised learning for what you could call the first time

into the 3D domain, for instance I also did some work on point clouds these were called Point Contrast but these works were perhaps more about demonstrating that representation learning as a concept is not just a problem for the image domain it’s a very universal approach or rather, a methodology

it doesn’t only work with images it also works in 3D space later on many people tried it on all kinds of medical imaging and also on robotics tasks all kinds of domains it holds up so this thing I don’t see it as a failure because it really has been influencing many many different fields beyond what we were focused on

like computer vision itself but on the other hand it still hasn’t achieved the same kind of impact as LLMs in terms of influence mm so then after all that, what came next? right, yeah it seems like we went back to an exploration phase all of this was at FAIR all done at FAIR

you were there for 4 years during that phase 4 years mm so was that the end of your FAIR chapter? not yet still early, still early that was probably the first year or two right there’s another fun story, let me brag about Kaiming again [laughter] so back then, resources were always an issue

GPUs were always in short supply and then FAIR made a decision to give TPUs a try see if this thing is any good Google had been using them they had fully transitioned to using TPUs so we got about 5,000 TPU chips these chips not bought, more like rented on Google Cloud and then

it was originally set up for people doing language models people played around with it and quickly found ugh, it’s way too hard to use really not user-friendly okay Kaiming stepped up and said, let me handle it so he truly, single-handedly I mean, again all on his own from start to finish

built an entire infrastructure on TPUs which enabled us to do all the subsequent work including MoCo including MAE including the later DiT all of it happened on top of TPUs so for me, this was a really important lesson which is how to summarize it… it’s like

a craftsman who wants to do good work must first sharpen his tools mm one thing Kaiming taught me was the ceiling of your research actually depends on how good your baseline is oh because if your baseline is weak you can easily fool yourself oh you won’t produce anything meaningful if you haven’t put enough thought

into the baseline level into building this system properly into pushing the engineering to its limits you don’t have a platform to do real exploration because you might find an interesting seemingly valuable signal but that signal could be completely wrong the reason being your baseline your benchmark itself wasn’t good enough

mm so this is actually quite counterintuitive because people always say if my baseline is a bit weaker then the performance gains I can show would be larger so it’s easier for me to publish papers right, but Kaiming doesn’t think this way mm he thinks about how to push the baseline as high as it can go

and then starting from that foundation whatever new things we build that’s groundbreaking work that’s a genuine breakthrough right anything you build on top of a weak baseline any improvement might just be a throwaway paper so this thing has also been an inspiration to me including when they were working on detection

I wasn’t part of that work I was still doing my PhD but all of those Fast R-CNN, Mask R-CNN Focal Loss, and the whole series of work all of that work was because they including Ross Girshick including Kaiming including Wu Yuxin who is now at Kimi they put enormous effort into building the infra

and building that codebase so that the baselines the baselines for these methods already far exceeded all of those random mediocre CVPR papers mm our baseline was already stronger than yours so if I take one more step up of course I’m going to go even further mm so I think I’ve always maintained this kind of

methodology I think I place a lot of importance on this kind of I don’t want to call it engineering because it’s not entirely just about the codebase itself it’s not like building a codebase at a product company that kind of relationship it’s more like the scaffolding for a research breakthrough

if your scaffolding is unstable you can’t build anything so this thing also influences what we do now but anyway, the point is Kaiming in terms of building this scaffolding was also truly exceptional I think you were so lucky because very early on someone told you a lot of the right ways to do things

so in many areas you avoided a lot of wrong turns I think I was incredibly lucky but I also hope though I think a lot of this really is on one hand, common sense but as you said, on the other hand for a student this might not be so obvious not so apparent mm like with this scaffolding thing

when we were at FAIR there was a running joke kind of a joke, sort of the story goes that the first lesson for everyone interning at FAIR guess what it was? mm the first lesson was to use a certain tool guess what that tool was? no idea that tool was an Excel spreadsheet

[chuckles] this thing is also quite interesting so we’d have this whole system for tracking experiments of course, this might be a bit outdated now because nowadays there might be better tools like Feishu many better tools but back then we would meticulously build this kind of template

and this template was just an Excel file so sometimes we felt like office clerks I do research every day but it’s not screen full of code writing some fancy stuff instead, staring at this spreadsheet this Excel file the spreadsheet looking at what each row represents the research part of this is how you design the spreadsheet

how do you make sure every experiment gives you what I just called this gradient right because you can always hit two extremes one extreme is you run too few experiments so your signal is unclear you don’t know anything the other extreme is I don’t care at all what experiments I’m running I just run experiments blindly

right I have all these resources I just maximize my resources run all the jobs dump all the results just throw everything into the spreadsheet and then feel satisfied thinking my research is done both of these are a pretty poor pattern for a student’s research mm but back then, by watching how Kaiming

built that kind of spreadsheet I learned an enormous amount right because you really have to make some decisions those decisions being, for my what metrics should I actually focus on right what should I actually be recording what columns should there be how should I define control variables and how to make each experiment as informative as possible

mm okay so let’s move on right, so what other things happened at FAIR then there’s also the thing about DiT right but let’s not jump to that yet let’s continue the FAIR story so after the self-supervised learning phase you entered an exploration phase again right so at that time, like I mentioned

actually there’s no real transition right, these things are all overlapping I may be doing one thing while also exploring something else right and at that time what I was most interested in actually was generative models at the time generative models was a big topic GAN was already quite mature by then right

then VAE and various other things were also starting to emerge yes then there was a paper which, back in maybe 2021 or 2022 at the time of the DDPM paper right, it’s the Denoising Diffusion Probabilistic Model mm this paper was very interesting to me because at the time the image quality

actually wasn’t that impressive yet I think the image quality was about on par with GAN or even a bit worse, right but in terms of sample diversity it was much better than GAN right because GAN always has this mode collapse problem right, it tends to just generate one kind of image right but this thing was able to generate

much more diverse content so I thought there might be something here but it’s still not clear enough yet then we had a meeting in the group and we discussed this paper and at the time Kaiming also said he thought this was interesting he also thought this was something worth pursuing but he had one question

and this question I still remember to this day he asked, have you thought carefully about whether this is a discriminative model or a generative model? mm I think this is very profound because the essence is you’re doing denoising when you’re doing denoising essentially you’re doing discriminative prediction

right but at the same time through multiple steps of denoising you’re also doing generation right so the interesting question Kaiming raised was in the end, is this thing a discriminative model or a generative model? and what does this boundary mean? mm I thought this was a very deep question

because in the end the things that Diffusion models are capable of doing completely blurred this boundary right it can do generation, it can do discrimination it can do representation learning all kinds of things so I think this is a fairly profound question yes so at the time, based on this question we did a lot of exploration

including things like trying to use DDPM or diffusion models for classification and checking whether the representation it learns is good and how it compares to a self-supervised model mm this was a line of exploration we pursued it was interesting and there’s a paper I’m not sure if it was published

actually I know it was published but it wasn’t published by us someone else did it mm but anyway, we did a lot of this kind of exploration but let’s first talk about the process when did this happen at FAIR? this was around 2022 to 2023 mm at that time diffusion models had started to take off

mm not yet, not right away this is before ChatGPT, right? mm this is before ChatGPT right, so this was around 2022 before or after Stable Diffusion? roughly the same time it was approximately the same time mm at that time Stable Diffusion was already getting attention right, that whole community

was also very active right so at the time I was very curious about diffusion models mm and we started exploring is the exploration you’re describing something you can do freely on your own without needing to report to anyone? yes, this is the freedom of FAIR right, that’s exactly the freedom I was talking about

yes so at the time in terms of the direction of research within the team, nobody was doing diffusion models at all so I was the first to start exploring this and later brought in an intern who was Bill Peebles yes, who is now head of Sora we started together right but I was the first to start at FAIR

and then brought Bill in later mm so back then I was exploring all kinds of angles and then later we kind of settled on the most important one which was the DiT direction mm and by the way let me also mention this DiT wasn’t the original goal at the very beginning right

the original goal was actually exploring the connection between discriminative and generative models mm yes, that was the original question mm right, and during this exploration we kind of discovered that this direction of DiT was more interesting mm and we focused on that ok then let’s not jump there yet

let’s continue talking about FAIR what was life like at FAIR? what was the culture like? what was special about FAIR? mm I think the most special thing about FAIR is it’s the most academic-like place inside industry that I’ve ever been to right, so a lot of the culture is actually quite similar to academia

for example everyone has a very high degree of freedom you can basically choose what you want to work on mm and at the same time you have a lot of resources the resources are beyond what you’d have in academia right so I think FAIR was a very ideal research environment for me at that stage

mm but it also has some problems, right like you said later on there were some cultural shifts right I think around 2022 or 2023 after ChatGPT appeared FAIR was going through a lot of changes mm right you’re using such a fancy-sounding term and you even have to say it in English

which shows how hard these things are to define it really is a research aesthetic right I think it encompasses everything I’ve mentioned above the specifics of how you do things I think all of that is included but it also involves some higher-level philosophical considerations like how Kaiming gave me the Diamond Sutra

I think he because the Diamond Sutra says all things are like dreams, illusions, bubbles and shadows and one passage also says: all phenomena are illusions if you see all phenomena as not phenomena, you see the Tathagata mm taking this a bit further it’s actually quite similar to certain ideas in Western philosophy quite similar actually

for example, Kant’s concept of the thing-in-itself and then Schopenhauer’s the world as will and representation right what they’re all trying to express I don’t know much about philosophy, I don’t want to sound pretentious but in my humble understanding I think what they’re all trying to discuss is what you see

is not the essence of the thing what you see of the world is not its true substance so when you’re reading a paper what matters is to break through the illusion the paper presents to you and question what lies behind this paper what kind of substantive essence does it actually contain I think the source of researcher taste lies in

whether people can truly set aside all these superficial appearances and then keep pursuing the path toward truth keep seeking mm I think Kaiming does this best if you think about this from a long-term perspective the question is: what is the right way to guide how you choose a topic what kind of things to work on

right this thing also connects to while you’re doing research what exactly should each step involve I think everything is consistent mm and then I think one problem with not having good research taste is people might get caught up in these appearances these appearances might be a paper’s acceptance

or the kind of fame you mentioned from the outside world or being able to get something done quickly and getting the kind of momentary praise and adulation right I think for Kaiming this is completely outside of his world model he simply doesn’t care I think right but if you ask me to list out research taste as points a,

b, c, d… that becomes pretty hard to articulate this thing because it involves so many things because research itself, as I said is also a creative process it’s also a writing process from the writing side, by the way Kaiming is also the person with the strongest writing ability he also strongly encouraged us, saying

make sure to start writing early this thing very unfortunately even now at my age I still can’t do it well like Kaiming all his papers were finished a month before the deadline at least that was the case at FAIR mm meaning while everyone else was pulling all-nighters to meet the deadline

and then feeling this huge sense of satisfaction Kaiming, you know was like a carefree free spirit having finished everything a month ago and then polishing it over and over again watching all of you rush to meet your deadlines I, in a very relaxed way have already made this thing perfect he finished everything a month in advance

everything done meaning the paper was fully written ah not just the results obtained, but the paper fully written this is already a publishable solid piece of work so that means he had to start writing when two months before the deadline and he only needed one month to write it no one month is a long time

right of course he would keep writing afterward during that month before the deadline he would polish every table every single word every punctuation mark ah for example, this habit also influenced me for instance, I now have this OCD like this kind of

how to put it obsession that also came from my time with Kaiming which is that in your paper not a single line should have less than 60% filled with text filled – what does that mean? meaning if you have a line and more than half of it is empty it doesn’t look good you need to fill that line or have it filled roughly

sixty to seventy percent then your paper looks more elegant elegant, or uniform oh and now with every paper I always ask all the students right, look carefully if you have some trailing word if people aren’t paying attention you’ll end up with a word sitting alone on a line somewhere

it looks terrible understood mm and also when Kaiming thinks about this, his view is this paper is not for you to read this paper is for others to read so you need to care about how others experience it mm how can you – a paper is just a vessel how do I, through this vessel of knowledge let people relatively smoothly get

to your own the core of what you want to express this communication interface needs to be pleasing to the eye that’s a great way to put it, right the communication interface must be pleasing to the eye so you can’t let your paper look too bad, right you have to get the details right so all of this you can consider it a kind of research taste

but I think this is actually something more general a kind of aesthetic toward life or toward everything in the universe mm I think these things are all connected right this is also why we care so much about our own papers being as unique as possible having our own distinctiveness we can have our own webpage design

we’ll record our own videos record videos but there are many people who wonder why you bother with all this this stuff has nothing to do with research isn’t this just a distraction? why spend extra energy polishing all this are you just doing this for hype and marketing? ah, I hope people don’t think that

because I think having your own style is actually very important mm and then this is also why all of our papers use a consistent template we have our own designs and indirectly I also hope to pass on some of my taste, again I can’t guarantee they’re all good but somehow

at least discuss it with my students we can work on this together at least together we can conceptualize think it through together right, I think this also, in my view this broader is part of research taste mm, it contains many very concrete small details an enormous number of details right but I think

this is also what makes research interesting I told you yesterday my childhood dream was actually to become a film director right mm childhood dream no, no when did that dream fade? it faded pretty quickly unfortunately but I still watch a lot of films but I think, eventually, I came to realize

the research process and filmmaking process are actually not that different why? because a film also needs to discover a theme it also involves exploration I have a story I want to tell and it shouldn’t be that I just stand at this moment and think oh this is how my story goes and then I just go straight toward the finish

it shouldn’t work that way either you should also go make the film I think you’d have great intuition right yes, exactly right the worst films are the ones that just go through the motions I start with A no conflict along the way and arrive at B and then it’s over I just play it for you

a good film actually is or, why do we say when writing a paper people say they told the story really well even though this might even have a bit of a narrative storytelling quality mm film is a storytelling process there’s a book I actually recommended it to students before I learned from Kaiming

I share with people some unexpected books let me recommend a book it’s called Story, by Robert McKee mm this book is a book about screenwriting mm but I think this book actually speaks to a lot of things about research and life there’s one thing this book talks about that I think is particularly interesting it talks about

what makes a good story it’s not a story that has no conflict from beginning to end a good story must be driven by conflict and through conflict to discover the true character’s core mm and in research it’s the same thing a good research paper must also set up the conflict and then through conflict

you discover the core of this problem and the solution to this problem right so I think this book has a lot of profound insights including about life mm and I think the concept of conflict in the book is actually similar to what I was just talking about that gradient mm you need enough contrast

to let you see the difference right for example if in your experiment you don’t have a good enough control group or experimental group your signal will be weak and you won’t know the answer right so having this kind of conflict this gradient is extremely important for research

mm I think this is really interesting, thank you so let me ask about another topic which is about your transition from FAIR to NYU right you transitioned from FAIR to NYU around 2023 right, to become a professor right can you talk about how this transition happened? right, so actually I spent a total of five years at FAIR

mm and for me this experience at FAIR I think it was the most formative five years of my career so I think I’m extremely grateful and this experience has really shaped who I am today mm but at the same time I always had this desire to someday run my own lab and take on students

because I think this experience the experience of someone guiding you is something I’m very thankful for and I want to pass on what I learned right so after five years at FAIR I decided to make a move and go into academia mm and so I joined NYU mm which by the way, NYU is a very interesting place

why? because NYU is somewhat unique it’s located in New York City in Manhattan mm right, so it’s surrounded by a lot of industry which gives you a lot of collaboration opportunities mm and NYU’s location in New York there is a relatively strong AI community here in New York right

for example, NYU has Yann LeCun mm who is of course a figure you don’t need to introduce mm and NYU also has Kyunghyun Cho who is also a very well-known researcher mm and then there’s also this whole community in New York like, for example Google has a large office here in New York

Microsoft also has offices here Morgan Stanley, Goldman Sachs lots of different types of companies mm so I think this is a very unique place where you can combine industry and academia mm right, so actually now when we’re talking about is Dumbo a community in New York? Dumbo is a very interesting place

in Brooklyn mm and Dumbo has become one of the more important areas of New York’s AI community mm there are a lot of AI startups here in Dumbo for example, some of the more well-known ones like Hugging Face’s office is here mm and then Runway’s office is also here mm

and then there are many other startups so New York is actually quite vibrant and the reason I chose NYU is partly because of this and also partly because of the people there mm so that’s how I ended up at NYU mm right, so then it turns out that the professor role after you actually start doing it

is somewhat different from what you imagined right? mm, I think many aspects are different for example, a professor has to deal with a lot of administrative work right things like grant applications various committee work right also things like things completely unrelated to research right

I was quite well protected at FAIR from a lot of this right but at a university you have to deal with all of it yourself mm so I think this is a very different experience and also advising students is very different from doing research yourself mm because advising students requires

not just doing the research but also helping students grow as researchers right and this is a very different skill set mm so I think transitioning into the professor role was actually a big challenge mm but at the same time, it’s very rewarding because you can see your students

grow right and I think this is one of the most rewarding things about being a professor mm I think that’s a beautiful thing to say so let me ask about the startup you founded right I heard that you are now a professor at NYU and also a co-founder of a startup right

what’s the story behind that? right, so the startup started a bit over a year ago right and the company is called Emu Video no, wait, that’s a product [laughter] it’s called Oasis mm so what does Oasis do? right, so Oasis is focused on AI-generated video mm

and specifically a game that is generated by AI in real time mm so the original idea was inspired by the DiT work and also by Sora mm and we thought this technology can be applied to games mm right, because games are actually an extremely good use case for this kind of technology

mm because games require very fast frame generation right and at the same time games require a lot of interactivity right so these two things together make games a very interesting application mm this thing can be applied to many many different papers no matter what your topic is

right, so I think this is also very interesting mm and then later we could maybe talk about DiT, right but this paper also this paper was again one of those that brings us to NYU no, no no, this one is also also FAIR it was the last piece of work at FAIR

oh and then at that time FAIR was already starting to have some culture shift because at that point ChatGPT had just come out OpenAI and then DeepMind were also doing very well OpenAI as an emerging research force mm, and then had actually done a lot at FAIR that nobody dared to even dream of uh

and even if they dreamed it they couldn’t do it right, so everyone started thinking what went wrong with this organizational model does there need to be a major overhaul there had already been many reorganizations this was also a trigger why I think by then it was no longer a good sign for me to keep staying at FAIR

things were already starting to decline not exactly decline just that everyone’s focus was no longer on research people would have these meetings that lasted several hours research alignment meetings coordination meetings alignment meetings alignment meetings and the only topic of these meetings was

what exactly should we be doing but these meetings went on for several weeks and still no conclusion because nobody would know what they want to do because this is completely counter to what I just described the normal bottom-up logic of research mm, right now it had become let’s all sit together

and discuss what research project we should do over the next one or two years in my view or in Kaiming’s view or in the minds of many researchers this looks completely anti-research right so at that time it had a lot of effect on us for example, at the time I was working on DiT Diffusion was also just getting started

nobody yet not a single person at FAIR was doing Diffusion Model research but I thought, hey this thing seems really interesting I think I should give it a try and then Bill Peebles was an intern I recruited at the time mm and he’s now head of Sora and also the main character in Sora’s various generated videos

he’s also the star of those mm, right he’s an extremely sharp person or, or in my view what I’d call a perfect PhD student in all directions, uh at least a well-rounded, all-around student right, but anyway our starting point back then was not to do Diffusion Model research nor to do DiT

in the first two months of exploration it was entirely focused on representation learning that is, we wanted to look at the representation a Diffusion Model learns how it compares to what a normal Supervised Learning or rather a Self-supervised Learning model learns what the differences are actually there was a lot of follow-up work in this direction

but what we started doing after working on it for a while, the feeling was this thing is okay just so-so it can learn a decent a generative model can learn a decent representation but this representation was much, much worse than the representation from self-supervised learning mm completely not competitive, right

so we gave up on that but in the process in the final month we discovered hey by the way, this thing the premise being because DiT we needed to compare at the representation level against, say, ViT-based systems to make a comparison so at that time it was why didn’t we use a U-Net

but instead used ViT for this Diffusion Model that was the starting point, right and then we found out, hey from the representation angle this doesn’t seem to add much value but it seems like our new architecture is indeed more efficient and indeed more scalable more stable than U-Net and from a code perspective

I care a lot about these things from your code perspective what I call Minimal Description Length (MDL) your code is actually quite important it can reflect some things if your code is short and can achieve the same purpose then your method will typically be better than one that requires thousands of lines of code an extremely complex system

even if it can do the same thing but the former this more elegant solution the simpler solution is always better I think this is also a kind of research taste in a sense so we found, hey this thing is both simple and it works and scalable and efficient so it seems like this thing is the direction we should be pursuing

so also a month in advance and then we went to work on this mm and at that point we were competing for a lot of resources people said why are you working on this? we need to consolidate resources now and we need to do something more meaningful a bigger project for example nobody knows so we need these alignment

meetings to discuss it but at least Diffusion Models wouldn’t be an important part of this critical path an important key member on this critical path right so there was a lot of opposition but I felt I could see that this is actually something very important because I think this, from an architecture standpoint

I’ve I’ve been doing architecture work for so long I think this is the future of Diffusion architectures right, it’s not the Diffusion Model what I said, the overall data architecture and the objective are all very important right, but on the architecture side this is an indispensable piece so this is why

in the last month we pushed in this direction and the results were very good in the end and we were able to show this really great scaling behavior and we submitted the paper to CVPR and we were all very happy and then the paper got rejected mm right, LeCun apparently tweeted about this yes

saying not enough novelty you might have done this thing uh, right you don’t have long stretches of math you don’t have a long complex structure you came up with a very simple structure and even though you got good results the reviewers weren’t convinced mm, right this is another lesson but by that point

I had actually started to come around I realized this whole thing about research papers in this huge random process whether you get accepted or not doesn’t matter at all so we then submitted to another conference didn’t change a thing and it got accepted as an Oral Paper mm, which proves once again this is a completely random process

but what happened afterward was more interesting after getting this paper I realized in every dimension this was better than a U-Net based system why not just use this right, you’ve unified the underlying logic at least on the architecture side, unified the logic you can share a lot of infrastructure it’s so efficient

results are good and scalable you can build even larger models so we thought this thing once this paper is out, there will definitely be a lot of attention which, by the way there was indeed a lot of attention lots of people discussing it on Twitter but we found, hey nobody was actually using it for anything

oh and then we started talking to people like we reached out to the Stable Diffusion folks by the way, I think Stable Diffusion LDM is also one of what I’d call those twenty-something foundational papers one of them but I also talked to some people there and then we also talked to some other big companies

so we were kind of at school at that time I was – this paper had just landed right at the end of my time at FAIR and the beginning of my time at NYU oh, so both affiliations were listed? well right, right – actually, no actually only NYU was listed and Berkeley because FAIR didn’t let us list their name

why? because first, they felt this paper, it’s OK it’s a paper. second, you had already left so don’t list our name mm, so then after this paper a lot of people started using DiT right and then we found that Sora used DiT as the backbone right which was a huge affirmation mm because at the time the Sora paper

mentioned DiT by name yes right, so this was something we were very proud of mm and then, later a lot of other models also started using DiT mm yes, basically all the main video generation models now use DiT as the backbone mm so I think this was a very important paper mm

right, so then let’s talk about the startup right so why start a company? right I think for me the main motivation was I wanted to see whether this technology that I had been working on for so many years could have real impact mm because in academia you write papers

and other people read your papers and they may use your ideas but you never really get to see the end-to-end impact mm right, so I wanted to take this technology all the way to building a product mm and also I think that games are a very interesting application mm

because games are one of the few places where both high visual quality and very low latency are required at the same time mm and this is actually a very hard technical problem right so we thought if we can solve this problem for games then the technology will be applicable to a much wider range of use cases

mm right, and also games are a massive market right so there’s a lot of commercial potential as well mm right, so that’s kind of the story behind starting the company mm so what has the journey been like since you started the company? mm I think

building a company is very different from doing research mm for many reasons right one is that in a company you have to think about the product and users mm which is not something you think about in research right and two is that in a company you have to think about

the business model and how to sustain the business mm right, which is also not something you think about in research right and three is that building a team is very different from advising students mm because in a company you’re hiring professionals who have different skills and backgrounds

mm and you have to think about how to align everyone toward a common goal mm which is quite different from advising PhD students mm right so I think building a company has been a very learning-rich experience mm and I’ve learned a lot from it mm

right, and the product you mentioned Oasis has gotten quite a lot of attention right? yes, I think Oasis got quite a lot of attention mm when it was first released mm and the demo got a lot of views and discussion mm right and what’s the current status of the company?

right we’re still pretty early mm we’re building out the technology and the product mm and we’re also thinking about the go-to-market strategy mm right, I think the vision is very clear mm but the execution is always the hard part

mm right, so we’re still working on it mm I think that’s very relatable so let me ask about your thoughts on the current AI landscape mm what do you think are the most important open problems right now? mm I think there are many

mm but one thing that I think is particularly interesting is the question of how do you build AI systems that can reason and plan mm right, because current systems like LLMs are very good at pattern matching mm but they struggle with systematic reasoning

mm right, so I think this is a very important open problem mm and another one is how do you make AI systems more efficient mm right, because current systems are very computationally expensive mm and this limits their deployment mm right

so I think efficiency is a very important problem mm and then there’s also the question of alignment mm right, how do you make sure that these systems do what you want them to do mm right, so these are all very important open problems mm right and where do you see things going

in the next five years? mm I think the next five years will be very exciting mm I think we’ll see a lot of progress on the reasoning side mm and I think we’ll also see AI systems being deployed in many more real-world applications mm

right, because the technology is getting good enough mm and the cost is coming down mm so I think we’ll see a lot more real-world impact mm right and what about on the video generation side specifically? mm I think video generation will continue to improve very rapidly

mm and I think the quality will get to the point where it’s indistinguishable from real video mm in the next year or two mm right what it means is a possible random event like this a kind of black swan event or some kind of shock a kind of, uh

this kind of this kind of event that takes you by surprise if for this organization or for this person or for this matter your gains outweigh your losses then your organization is what’s called antifragile mm so this concept I think is very interesting right because normally when we think about

risk management we think about how to avoid risk right but the antifragile concept says no, you should actually seek out certain kinds of risk or rather, certain kinds of volatility mm because these can make you stronger mm right and I think this applies very well

to research mm because in research you’re constantly facing uncertainty mm and you need to be antifragile right meaning that when things don’t work out you should actually learn from that and become stronger mm right, and I think this is a very important mindset

mm and I think Kaiming embodies this very well mm because when things don’t work out he doesn’t get discouraged mm he just tries something different mm right and I think this is a very important trait for a researcher mm right

so is there anything else you want to share before we wrap up? mm I think one thing I’d like to say is to young people who want to do research or start a company mm I think the most important thing is to find something you’re genuinely passionate about mm

because research and startups are both very long journeys mm and there will be a lot of hardship along the way mm and if you don’t have genuine passion it’s very hard to keep going mm right and also I think finding good mentors and good collaborators

is extremely important mm because, as I’ve been saying throughout a lot of what I’ve learned came from the people around me mm and so surrounding yourself with great people is one of the most important things you can do mm right that’s really great advice

thank you so much this has been a wonderful conversation thank you yeah, thank you too alright so let’s talk about your view on the AI landscape right now mm especially in New York right what are some of the interesting things happening here? mm I think New York

is becoming a more and more important AI hub mm right, there’s a lot of talent here mm and a lot of interesting companies mm and I think New York has a unique advantage in that it’s a very diverse city mm and this diversity can lead to very interesting collaborations

mm between AI and other industries mm like finance media fashion healthcare mm all of these are very well represented in New York mm so I think New York is going to play an increasingly important role in the AI landscape

mm right and what about comparing New York to Silicon Valley? mm I think Silicon Valley is still the center of the AI world mm right but New York is growing fast mm and I think New York has a different kind of energy

mm right, it’s more multi-disciplinary mm and I think that’s actually very good for AI mm because AI is ultimately going to touch every industry mm so having this cross-disciplinary environment is very valuable mm right

that’s really interesting so let me ask one more question which is if you were advising a young researcher who wanted to make an impact in AI mm what would you tell them? mm I think first and foremost work on problems that you genuinely care about

mm right, because your passion will drive you through the hard times mm and second be willing to work hard on the fundamentals mm right, don’t skip the basics mm because the fundamentals are what give you the tools to solve hard problems

mm and third find good mentors and collaborate with great people mm right, as I said a lot of what I’ve learned came from the people around me mm and so the people you surround yourself with will have a huge impact on your own growth mm

right thank you so much this has been really insightful mm I think we’ve covered a lot of ground today mm right from your early research all the way to starting a company mm and your thoughts on the AI landscape mm

so thank you so much for being here today thank you it was great talking to you yeah, likewise alright so that wraps up our conversation today mm I hope you all found it as interesting as I did mm right and please subscribe to the channel

and leave a comment if you have any thoughts mm right see you next time bye in a really difficult position right why mainly because, first not enough resources let me give a simple example for instance, when we apply for funding the U.S. funding system

I might be going off on a tangent here but the U.S. funding system over the past few decades has barely grown at all even with high inflation, right everything has become more expensive tuition fees have also gone up a lot but government grants as well as the kind of proposal programs that companies offer the funded projects

are still maintained at a very low level so on average a body like NSF a U.S. government agency can give each individual PI a total of about $500,000 in funding per year over five years so about $100,000 a year right, and then a lot of companies have actually cut back a lot

again because of ChatGPT because the era of LLMs has arrived and everyone has gradually started to pull back we can talk more about this later but in any case, there are fewer and fewer opportunities from industry for this kind of sponsorship and once in a while if there’s some kind of funding opportunity they’ll typically give you

maybe $100,000 to $150,000 that’s just a one-time thing a one-time lump sum of that much as a grant but you know there are probably about 100 schools 100 professors at the same time or even more, competing for that $100,000 what can you do with $100,000? you can fund one student for one year as tuition what else?

you can buy half an H100, or a small cluster mm or buy maybe 3 to 4 GPUs so you really can’t get much done with that and of course, this isn’t just me venting all of us so-called junior faculty in the U.S. are living in quite difficult conditions everyone has to find their own way to get different resources

so this is also why it’s a bit like a startup you’re in a very constrained resource situation resource-wise and you have to find resources from different places you have to fundraise, right? Xiaojun this is Business Interview show I said I’m not commercial at all but actually in some ways there might still be some similarities

and then including people at Google we I had a collaborator at Google and he’s quite unusual he never goes into the office and he said, hey he said, we could have a chat and I said, sure let me come chat I flew to the Bay Area to see him and he said we could talk but not in an office

let’s go on a trail hiking on the trail next to Google’s campus mm, go hiking mm, talk while hiking mm, so in the middle of summer I hiked with him for an hour and I told him about the infrastructure work we’d been doing on TPUs these contributions these contributions and also why building this

longer-term collaborative partnership this kind of relationship would be good for Google and good for us right, so I thought hey, isn’t this just like a fundraising process? so in the end it became a kind of alms-seeking alms-seeking the process of seeking alms right, right, right

right indeed, because because this kind of sponsorship actually asks for nothing in return right, so I’m very grateful to Google but anyway I think who I should be even more grateful to is my students and they, bit by bit overcame many, many obstacles like I have a few students I have several students

like Peter Tong Boyang Zheng Shusheng Yang and many others and they all made very significant contributions on TPUs mm right, and good right, and good so that’s the background meaning we now have some GPUs to work with and now we can work on things that are a bit more

closely related to large models so this is why I started working on the Cambrian project right, uh and of course all of these narratives these stories are still completely rooted in my logic from all these years which is, uh first, representation is extremely important second, regardless of whether you’re solving

a standard computer vision task or we’re now in the era of multimodal large models and solving these problems through VQA I think all of these are like all of these are like all of these are like right, and underneath it all there’s still something substantive that we need to think through right, and this part

anyway, about language and vision we can talk about that later and I and then we later also had a paper called Cambrian-S this paper goes even further we’re not just doing image-level VQA tasks we want to also involve video to deal with video right and this thing actually the real reason I genuinely wanted

to work on this goes back to films again and also has to do with two Chinese directors I like quite a lot director Jia, you know Jia Zhangke and Bi Gan both very well-known Chinese directors right, Bi Gan’s Kaili Blues extensively uses long takes and this made me think, okay while to him it’s a visual tool

for humans, this is also a very important a very important medium for visual understanding because, what is a long take? life itself is one long take our eyes are our camera mm we are constantly doing all kinds of things in this world right, and the things we see the medium is video it’s all video

right but we can see the pixels in this video and everything behind them we can reason about causality we can perceive space right and Jia Zhangke said something I really agreed with deeply he said what makes film so interesting this was when he told me this in New York he said this is very interesting

is that if you just look at the timeline this is a timeline it’s a linear timeline but at every point on this timeline you need a space to extend its time like we’re talking right now even though it seems like a static frame but imagine you had a long take or rather you’re on the streets of New York right now

under the Dumbo bridge below Dumbo right what you see is still frame after frame mm, right but what it represents behind those frames is the state of the world the global information of the entire space this thing completely transcends what a single lens encodes that individual, isolated each individual frame

I think I think this makes a lot of sense so this is what made me think we still need to work on video going forward even if video is hard to work with even if video requires handling massive amounts of data we still have to do it so with Cambrian-S that’s what we’re doing and this work is a bit like a position paper

a position paper is a kind of how should I the translation would be an opinion paper meaning I want to put forward this kind of viewpoint so in that paper we discuss the concept of super sensing meaning the concept of hyper-perception and we also it’s also a paper about data it’s a paper about

architectural structure and it’s also about a paper on spatial intelligence so Professor Fei-Fei also helped us with a lot of invaluable advice mm-hmm but the core idea is we want to define a paradigm for where multimodal AI should go from here right, and then so if you look at this problem step by step

meaning we this may be an imperfect analogy but you can draw a parallel with autonomous driving you might have an L0 system a system with nothing at all it’s basically an old language model it can’t perceive the world at all all this visual knowledge it can’t see images it can’t see videos either

right but it can, through language like Plato’s Cave allegory indirectly understand the world that’s fine we call it L0 L1 is the current multimodal system with slightly better capabilities it’s capable of what you’d call show and tell meaning you show it something and then it can tell you

some answers about what you showed it right, you ask it a question and it gives you an answer this might be L1 then L2, I think, is what I call streaming event cognition meaning now this thing doesn’t just look at a static image you’d have a continuous, streamable visual stream like this a visual stream

your intelligent system needs to be able to understand this visual stream and be able to process this visual stream and also be able to answer questions be able to understand what’s happened right, and then the next stage uh, I call it spatial cognition meaning this is about what I was just saying which is that you

at every point in this temporal sequence how to see beyond the present moment to what’s really behind it — these the space behind these pixels right this is also something very, very deep for humans a very unique ability and ultimately actually, um I think the endgame is we need a predictive world model

yes, some kind of predictive world model this is what can tell you everything about the real world you observe yes, I think what I want to convey through this paper is we’re building a staircase step by step leading toward a future with a world model mm-hmm um, although we may not know exactly how to define this world model

at least in this paper we won’t attempt to do that definitional work but we can identify which capabilities are absolutely necessary yes, so that’s the core of this paper and this paper um, we also filmed a short video which I also posted on Twitter some students we didn’t spend any money it wasn’t for promotion

just some students with cameras filming on the streets of New York um, unfortunately we weren’t able to shoot a Bi Gan-style long take but filming as we walked it was a love letter to New York, I suppose and then but a lot of people didn’t understand saying why are you filming this does this have anything to do with your paper

mm-hmm I said of course it does our paper itself is about an intelligent agent living in the real world how it can ingest this continuous visual stream signal and be able to perceive what’s happening in the world it might be moved by certain things right be surprised feel astonished

but most of the time its brain will have some kind of spontaneously operating world model guiding everyone to be themselves guiding everyone to live in this world yes, I think this paper is actually quite interesting because I had never done this kind of work before kind of like wanting to set an agenda defining the problem like this

so so, I also hope to learn more from Professor Fei-Fei Professor Fei-Fei often talks about the North Star, right so the question I’ve always been asking is what exactly is the North Star of vision mm-hmm, what exactly is that question and how should we solve it yes, so that’s this paper did you find the answer um, I couldn’t find the answer

if I’d found the answer I wouldn’t be sitting here I think this is an ultimate question mm-hmm I don’t think this is just a computer vision problem or rather, what I actually want to say is actually, the term computer vision is also very interesting it’s called vision and vision has a double meaning it’s a very ambiguous word

vision refers to both your eyesight and your foresight about the future right, when you say someone has great vision meaning they have a grand vision visionary, vision, yes um, so I think computer vision actually I’m not going to um, this I can say I am someone who works in computer vision

yes, but computer vision in my definition it’s a perspective it’s not a specific task it’s not even a specific field it’s a perspective perspective means it’s a point of view yes, or rather it is I think intelligence — it’s quite fundamental it’s a collection of problems that intelligence must solve

it’s a collection right, let me be more specific so what is vision or what problems does vision address mm-hmm I may not be able to articulate it clearly let me think um, first, the signals it handles are in continuous space high-dimensional, noisy signals mm-hmm right, these are the problems computer vision needs to solve

the problems computers need to solve it’s not about writing lots of text on paper we need to evolve some kind of intelligence that doesn’t avoid this problem it addresses this domain its its target this domain is completely different from language right continuous, high-dimensional, noisy signals

these are the problems Vision needs to solve second, from the very first day of doing Vision from the first paper I just mentioned starting from DSN or HED I already knew or rather I had this kind of bet that vision the most important thing is to learn this kind of hierarchical representation hierarchical representation

this is extremely important if your representation lacks hierarchy you won’t be able to solve many, many problems in this world the hierarchical process is an abstraction process the process of abstraction is what’s called a generalization process a generalization process this is also very different from a language model

because a language model operates purely in the semantic space when thinking about this problem so there are of course other characteristics for example, I say vision as a perspective, um for example, I think it’s also this kind of large-scale parallelization we can now see many, many things many areas of our brain’s cortex are firing

right, and then we’re processing in parallel many many different objects and their causal patterns and their physical changes these things are happening at different times and in different spaces all simultaneously and we have a way to capture all these changes I think this thing

is also an important characteristic of vision um and finally, there may be one more, which is some kind of um I’m not sure how to define this thing some kind of feature sharing what this means is for example, I look at the semantic part of this matter or the real understanding part may be a bit more

that is to say I now see a dog drawn by a child and a cartoon dog in an animation and a real dog running around in the real world right, and then how do I connect all these different visual entities together, right building this kind of abstract cognition saying, hey, they’re all dogs, right even though they’re vastly different

in this, um from a data perspective, you know they’re so far apart not a single pixel is comparable so what I want to say is, um vision may have even more problems to solve I actually haven’t thought carefully about this yes, anyway it’ll have some common characteristics like these these features right, hierarchical structure

and this kind of continuous domain modeling, um continuous domain modeling and also this kind of this kind of large-scale parallelism and large-scale sharing I think these things are all part of an intelligent agent this thing cannot simply be reduced to just a computer vision system solving a small subset of problems

mm-hmm so that’s why I think computer vision I think, I think I think although fewer and fewer people are working on this direction students are also increasingly fewer fewer students are applying to this area when people are undergraduates when choosing this direction they’re also increasingly unwilling to choose it

right, something called computer vision um, and then and when faculty are hiring, too we’re probably increasingly less likely to hire a professor doing pure computer vision but I think this is if you consider computer vision as a perspective I think it’s the essence of intelligence look at the past few years

after ChatGPT arrived CV was previously very central to occupying a very central position in artificial intelligence of course, this happened after you entered the field um, in recent years LLMs have risen CV has been pushed back to a more marginal position in this process do you think people like you feel discouraged um

I don’t feel discouraged at all I feel not the least bit discouraged I think, as I said I should be grateful for LLMs yes, without LLMs Vision couldn’t have expanded into the truly large scope of multimodal intelligence it has now from the perspective of vision’s development history there are actually two axes you can draw them — this axis

goes back to ancient times, right at the earliest stage the things computer vision needed to handle were always the most singular most concrete and simplest tasks like MNIST digit recognition, right 1234, I need to determine which digit it is and then later there were some small datasets like CIFAR data a 32×32 pixel

ten-class classification problem is it a cat or a dog is it a car or an airplane and then later datasets like ImageNet appeared it became a 256×256 level doing classification, right um, but at those times things were relatively controllable and then later there were detection and segmentation

this more structured kind of cognitive process and these are compositions and then later, right if this axis continues to advance, it leads to the rise of multimodal large-scale models because of the introduction of multimodality we can easily abandon many of these specific relatively rigid task designs

this kind of task design and now I can take an image and ask all kinds of questions suppose this thing language as a great interface can or language as a great interface it can help you solve many many problems right, so you can see over this time um, this axis this axis, um goes from simple to complex tasks

such an axis but also an axis where language starts gradually entering computer vision so then there are two issues here the first is that after language entered vision it brought us enormous benefits allowing us to freely define problems we can ask anything and we can get any answer mm-hmm

but the second important risk is language’s involvement has led to your dependence on language also increasing mm-hmm so many so-called multimodal cases these tasks are actually unrelated to lan— unrelated to vision purely a language problem mm-hmm from this perspective um, of course I think, yes

vision seems to have become marginalized mm-hmm, right but of course I don’t feel discouraged I see it as an enormous opportunity because in the end if the problems you’re solving now are relatively simple then it doesn’t matter problems you can solve with language just use language to solve them

right, um even though I haven’t seen I can’t do so-called grounding meaning I can’t know the red apple you describe to me what exactly what is red what exactly is an apple but somehow through statistical information in language I can still complete some decision-making tasks no one can fault you for this

I think that’s fine but the huge hidden opportunity is when the day truly comes that we need to deal with the real world real tasks to build some kind of real intelligence ah then this currently imperfect visual representation will be a major deficiency so Yann LeCun’s view is everyone right now is just using a crutch

that crutch being the language model itself right, and even though you can walk and you’d think hey, I’m walking pretty well but you probably can’t run and you can’t participate in the Olympics right, because you have a leg this is the so-called leg of visual representation which is still still not good enough

why do you call it real intelligence why isn’t LLM real intelligence because I think LLM is virtual intelligence but our intelligence so-called intellect isn’t that also virtual oh, I think the word virtual may not be right what I define as real is something that has to interact with the real world yes, what does that mean

meaning, look the problems that LLMs can solve well now mostly still occur in the digital space mm-hmm mm-hmm, for example um, it can memorize all this factual knowledge it can know right, we can put all these Wikipedia articles all in there and it can tell us everything we want to know

it can serve as a very good legal advisor it can even help summarize knowledge and do education do teaching a lot of these things right, and I think LLMs um, are of course revolutionary but this is different from the vision as a perspective that needs to solve problems actually they’re completely different domains

meaning meaning if what you need to handle is continuous high-dimensional space in this kind of noisy domain then things like, for example, robots these domains aren’t just robots by the way, robots are one good example I’ll get to that in a moment ah, these things are very hard to tokenize they’ve already left this virtual space

left this digital space right, what kind of tasks does this involve you’re absolutely right I think robots are there will also be many industrial applications, right industrial process control meaning some all those involving sensory modeling signals with many different kinds of sensors

right, these kinds of sensors and they perceive what’s happening in this world and you now need a unified algorithm to model this environment this system so that you then perform an action or intervention meaning that when you you are take an action or make an intervention you’re able to predict

how this system will change next this is very hard for LLMs to do mm-hmm and you’re absolutely right about that I think from my perspective, there are actually two extremes one extreme is LLMs, um very good at operating in the digital space doing many many things and also very good at using coding as an interface

right, through agents to intervene in our physical lives um, this will also happen and that’s fine but ultimately it’s still based on discrete tokens token-based these one-by-one positions ah, on the far right is Robotics this Robotics is truly it must be true truly general-purpose robotics meaning it can generalize to

generalize to a certain degree such that it can do everything a human can do mm-hmm, it has its own decision-making system and it has its own brain mm-hmm, and I feel now that these two extremes right, and then and how from LLMs step by step it extends to Robotics I think this is what computer vision or, in the new era,

visual intelligence needs to solve right and then I think this is also the future of multimodal mm-hmm because obviously, robotics still doesn’t work now and I often tell students or people around me actually, um the thing I most want to achieve is to solve the Robotics problem without doing Robotics

why is that mm-hmm, because you think the Robotics approach can’t solve the Robotics problem not exactly it’s because I think each of us I think Robotics is advancing too quickly right now at the Spring Gala Festival there’s Unitree Robotics and all that yes I think I find it all rather jaw-dropping

but on the other hand I think there still needs to be someone focused on the pre-training part which is what’s called the robot brain what exactly it is mm-hmm or how this brain includes your visual system right, in the control part in the hardware part this part also means brothers climbing the mountain, each making their own effort

I don’t think I need to intervene in hardware too early and do those things right I think there are fundamental research problems now that haven’t been solved at the software level haven’t been solved in building this brain we need to focus first on solving this part of course many people will argue you have to have

something like a closed loop you need some kind of collaborative approach you need to validate on your robots otherwise if you build some algorithm now some model may not be useful mm-hmm I fully agree with that but I think this can be done through some kind of partnership yes, I just don’t want to

buy this I also don’t have the money I can’t afford that many robots robots also have their own hardware scaling by the way you need to buy many robots to do hardware well mm-hmm yes, I want to focus on the brain part and I think this is a problem that computer vision needs to solve

a problem that representation learning needs to solve and also I think ultimately the problem that a world model needs to solve look at Kaiming, he started thinking about this so early wanting bigger, bigger, bigger mm-hmm why why did LLM Scaling Laws come so much earlier than CV um, good question yes, I think first of all we can’t say that much earlier

because CV currently doesn’t have a Scaling Law right, and actually before I was we were all pretty desperate I said, oh no this vision how come it still doesn’t have a Scaling Law now maybe it’s alright now for example these video diffusion models have some Scaling Behavior what’s called Scaling

is that you can consume the data yes, and then you can you can you can get better results right or rather this more formal characterization meaning your Scaling Behavior meaning if you now have a Transformer system then I now satisfy this ratio like C=6ND meaning your your compute is basically equal to 6 times

your tokens times your number of parameters and I want I want to use this formal definition to make this point because I now think more and more that vision doesn’t need a Scaling Law oh, why is that because again what vision cares about is completely different from what language cares about

it’s not a radical claim but it is a viewpoint a long-held view and many people doing NLP actually agree with this view that is, a language model is actually not a self-supervised learning process it’s actually a strongly supervised learning process meaning it’s a strongly supervised process it depends on how you look at it

what does supervised or unsupervised mean yes, the logic here is as follows generally speaking we say whether you have external annotations external labels this determines whether you are self-supervised or or strongly supervised learning right, but language is such a special case what is language language

is what humans over the past few thousand years of civilization through continuous evolution whether in a sociological sense or in each individual person’s sense and processed everything about this world and stored it in a tokenized form storing it down and we happened to have something called the internet and we uploaded this knowledge

all to the internet so for all LLM researchers this is for free but something being free doesn’t mean it has no labels then one question is suppose we didn’t have the internet then if you wanted to train language models now could you still do it put books in yes or suppose you had no books

right, yes exactly, this kind of knowledge upload this thing is itself a process of supervision construction right so this is different from vision so it’s somewhat like language um, wanting to solve problems always staying in this target y space as we usually say you have a mapping from x to y

that’s all machine learning you can through some regardless of where x and y are you can define the problem this way anyway and y is usually what people call supervision is the label, and x is your data right you can think of this language model as actually only characterizing things in the y space mm-hmm

mm-hmm, but this is true going back to the earlier question meaning this is actually insufficient to represent the totality of this world there are many things that you can’t through language describe and characterize or rather this is both the advantage of language and also language may eventually, as I said, gradually fade

or rather LLM won’t be the foundation of the entire world model that’s one reason the reason is its advantage is you don’t need to do anything to achieve some kind of alignment with humans because every sentence and every word you write is written by humans is written by humans mm-hmm, right

when you write this down what is language language is a communication tool language is not a thinking map language is not even a decision-making tool it’s a form of communication it’s actually a communication tool mm-hmm so if it is a communication tool you always have to make some trade-offs

you always have to sacrifice something so, ah, and then I think I think, um what I mainly want to say is yes as a communication tool it aligns well with humans but on the other hand it has also lost a lot which it originally as an intelligent system should be modeling mm-hmm, right

for example, right now I have a cup of water I have a cup that fell on the ground and broke this is actually a linguistic the reason we say it this way is because this is the most suitable thing for our communication we only care about the outcome and state of things right we don’t care how a cup fell to the ground

and how exactly it broke right, which physical laws it obeyed the dynamics behind it what exactly they are yes, so what exactly are its dynamics we don’t care about these things right so I think this is also a limitation of it mm-hmm LLM people would complain that after adding vision

it might affect their intelligence ah, why, really yes, he hopes, um like Yang Zhilin, saying adding multimodal they hope it won’t be a dumb multimodal ah, yes I agree of course you shouldn’t use a dumb multimodal but I think if you don’t add vision you’ll definitely be dumb and, but I think

the fundamental issue is how to define smart and dumb yes, it’s about intelligence the definition of intelligence is different the definition of intelligence is different and or rather how exactly to define what is a simple task what is a difficult task mm-hmm over the past few decades

all these AI researchers would continuously encounter this so-called Moravec’s paradox this Moravec’s paradox what this paradox says is things that are easy for machines or um the easy problem is hard the hard problem is easy this is a paradox meaning things that are easy for machines

are actually hard for humans and things that are hard for machines are actually easy for humans you seem to have several works at NYU um, right I think starting with V* um, V* is actually just one piece of work I think it’s quite interesting could you talk about it because we were the first to think about

wanting to build in a multimodal system a system two what’s called that can do scaling at test time such a model meaning we when we look at the world around us for example I want to ask you a question now right for example like something around you there’s a trash can nearby

what color is it you won’t directly like a language model directly tell me an answer you’ll definitely first think where is this trash can you might turn around and look discover there’s a refrigerator over there maybe the trash can is next to the refrigerator then you’d localize this object and find this object

right, and then tell me an answer so you have this visual reasoning here right, some kind of visual reasoning here and then this thing it’s entirely a behavior in a reasoning process right, and then and then this thing we built such a system back then and this is also um, for example, before o1

a very long time yes, at least a few months and we started doing this mm-hmm, right at that time this kind of test time scaling was not a buzzword at all nobody had been talking about this okay, right and I think this is worth talking about because for me it’s actually an inspiration I think it’s both

I think it’s a bittersweet kind of lesson meaning it the bitter part is let me first tell you what happened after we had this paper we had our own benchmark and then we found meaning I have two friends Alex Kirillov who’s also the author of SAM and Bowen Cheng

both of them work at OpenAI mm-hmm, so I talked with them for a long time we told them what our work had done our benchmark is here now you can try it out and I also discussed some of the logic behind it right, meaning how you can do this kind of visual thinking and later

Alex and Bowen drove this project at OpenAI drove this project this project is called think with image and later, maybe over a year later right, and then this product launched mm-hmm, and after this product launched it was called think with image and inside, many examples or their benchmarks were actually the benchmarks from our paper

oh so what makes me very happy about it is this is the first time I thought, hey we can actually find a way to truly take a different path this can somehow inspire researchers at OpenAI to improve their own models mm-hmm I think this at least makes me feel there are things to do in academia

mm-hmm but on the other hand um, it’s also rather bitter because you see, at that time OpenAI, right at the time of Sora why people were able to accept DiT was also because DiT um would be cited in Sora’s blog post or Bill’s name being on it letting people find this logic

and the clues behind it mm-hmm, right but unfortunately I think, gradually in recent years industrial research labs have become increasingly closed so at first everyone published papers later people couldn’t publish papers anymore you could write some blog posts you could add acknowledgments

and also list the names of each team member and further on you could publish a blog post but there could no longer be author credits only OpenAI team or Gemini team that’s it so I think this mm-hmm will lead to, I don’t know whether the next, originally healthy kind of exchange between academia and industry

those channels will be cut off mm-hmm, right doing research is fundamentally a labor of love we explore these questions not really because it can deliver some product or earn how much money but on the other hand, um some kind of credit assignment meaning letting everyone know who did what

I think this is something that over the past few decades has supported academia’s ability to move forward a mechanism but now this mechanism is gradually being being eroded by LLMs this generation of models and the organizational structures behind this generation of models I think gradually broke it it’s become commercial competition

it has become a form of commercial competition mm-hmm, yes right, and then let me quickly conclude I think there are two more I want to briefly mention this paper, that is this REPA this thing is called representation alignment look, there’s another keyword: representation so that’s why I really like this paper

but this paper also went through such a long time and all these past works combined in a strange way formed a kind of chemical reaction mm-hmm, and then opening up, at least a small research domain and what it does is quite simple it’s essentially a Deeply Supervised Net meaning a model you have now

doesn’t only have a diffusion loss at the top which is your final objective you also pull out some other things in the middle these objectives you can have other objectives the objective we used is I want to make a Diffusion Model which is a generative model by the way have its internal representation able to align with an external self-supervised

model’s representation to align together mm-hmm here again, what’s being said is representation is the most important thing not only for systems like Cambrian 1 for doing multimodal understanding is it important it’s important for a generative model generating images generating videos too

yes, so this thing I think it’s something for me quite a big inspiration but this hasn’t been done thoroughly yet meaning why do we need to use this kind of Deeply Supervised approach such an indirect way to do alignment ah what if can we directly use this powerful

representation as a encoder for your generative model or as its foundation mm-hmm, right and this thing took another step forward we also got very good results this paper is called Representation Autoencoder yes, it also involves representation and autoencoder but anyway in this

the logic in this thing I think again I don’t want to talk too much about this paper’s details but I think there’s one thing Professor Ma Yi (founding director of the Institute of Data Science at HKU), when I visited Hong Kong I think what he said was absolutely right he said a student would ask, hey you’re doing this right

your autoencoder your representation layer will now become very high-dimensional because it’s a representation now it’s not the original simple pixel-level representation nor is it a low-dimensional VAE-type representation it’s a high-dimensional representation you want to do denoising and image generation on this high-dimensional representation

this is actually a very difficult thing right, and a student asked at the time this dimension is too high it might not necessarily be a good thing and then it might make our learning system more complex or make training harder first of all our results are completely the opposite conclusion but Professor Ma Yi got very excited

he stood up and said I want to sincerely tell everyone you must not be afraid of high dimensions high dimensionality is in all machine learning an extremely important cornerstone um, including whether in previous so-called kernel learning methods kernel methods or why in a Transformer we need to have an Up Projection Layer

right, you need to have a low-dimensional vector coming in and then turning it into a 4 times larger, 4 times wider Fully Connected layer and then all these things are all telling us the following fact that in a high-dimensional space many problems that couldn’t be solved in low-dimensional space

can now be solved many problems many types of information that didn’t exist in low-dimensional space can now exist and you’ll also have better efficiency ah this is this is traditional machine learning theory why you need to do after increasing dimensions making things making your data points linearly separable

all the same logic but I feel very encouraged in that you should not be afraid of high dimensions I think these are very good words because many times people feel afraid right feel afraid not just high-dimensional representation this thing but also afraid of escaping from some current local optimum meaning right now

many things we’ve done before were all done to jump out of this local optimum mm-hmm like VAE is the current era’s local optimum we hope to use a representation learning approach to link everything together and this thing is actually a very natural thing and then now many people are also working on related papers

there are many contemporaneous works all also very good but on the other hand this is also a not-so-natural thing because you need to break out of the existing framework to do something new yes, but when you can jump out of this local optimum and do something new I think you you’ll feel like your world has opened up

because RE for us or for my research I think it’s still a fairly important work because it tells me something or allows me to make a bet or predict a future what that future is or whether it’s right or wrong we can look again in a few years so this thing is also related to language and also to Diffusion Models

like the recently popular Seedance and Sora mm-hmm my current bet is there’s only one thing in this world that is important which is how to learn to learn this representation this is important when you have a good enough representation handling other problems on top of it is simple your Language Model

will gradually degrade to a simple communication interface unlike now all this multimodal intelligence is driven by large language models your representation layer only provides some simple a little bit of context right most of the so-called heavy lifting the dirty and heavy work is all done by large language models

mm-hmm the bet I want to make is the future won’t be like this in the future you’ll have a great foundation mm-hmm it’s a but it’s also a great world model mm-hmm, and then what does this world model mean we can talk more about this but this foundation itself may not be a checkpoint

it might be neural modules connected together, multiple components forming a cognitive architecture wow, that sounds quite complex but essentially it’s your brain it has different areas handling different things right the language, LLM layer will gradually become your essential representation or rather

the foundation of your world model an interface of mm-hmm it’s still very important it will never disappear because humans need a Large Language Model to ask questions and answer questions right to communicate with it need to communicate with it it’s a communication interface

right also there’s another line which is Pixel Generation itself meaning how you generate an image a video itself this thing through REPA some of our previous work we can see it also needs to be based on a good enough representational foundation ah or you can think of it

it’s a world model um again in my view in my definition representation is a world model the most, most important part mm-hmm it’s not all of it it’s the most important part but when we have such a foundation you can think of it we can easily decode it into language

right and then we can easily decode it into pixels and generate videos we can also decode it into some kind of action some kind of movement so it might be some kind of analog to current VLAs mm-hmm but it’s based on a stronger representation a stronger world model architecture what parts does the current representation include

language is one of them um, I think it’s one of them and then but this is also controversial meaning like Zhilin you just mentioned he might say he doesn’t want vision to contaminate language ah they’ll still do multimodal but they want to think about how to make multimodal a smart multimodal

right without lowering the overall intelligence level of the brain yes, yes, yes hey, about this thing but I want to say again this thing it really depends on how you define the problem but let me finish the earlier point first meaning um, this you say for example, the position of language in this

right I think we also have our own worries meaning language is actually a poison or language is actually an opiate you add more language you’ll always feel happier oh, mm-hmm that shows it’s useful this crutch it’s useful but it’s a shortcut if you as a person

if you keep taking this opiate you’ll be ruined if it’s a crutch and you keep using it you also can’t train your leg muscles mm-hmm alright, alright this is yours and Zhilin’s two perspectives yes, so I’m very worried about language contaminating vision

mm-hmm I’m extremely worried about this and moreover this contamination is already happening this the state of this contamination is as follows the state of this contamination happening is the entire Large Language Model has a huge value chain that transmits step by step from industry to academia this value chain means

we have a narrative at the top this narrative is whatever AGI, Scaling Law The Bitter Lesson, LLM the logic of these narratives the current bible yes, um let me tell you about The Bitter Lesson because I absolutely don’t think the Large Language Model is a demonstration of The Bitter Lesson

mm-hmm um the Large Language Model is actually anti-Bitter Lesson ultimately what representations will be general enough what is its endpoint ah, the endpoint we can call it the world model so maybe we can discuss in my definition or in the context of this representation what exactly does world model mean

what is a world model right this is about to enter your entrepreneurship topic let’s first from multimodal to world model mm-hmm, right mm-hmm, that’s right in strict definitional terms a world model means you’re now given a system or the state of an environment um

um this environmental state might be, for example, um you can think of it as the state at the current moment but a world model doesn’t necessarily just make temporal predictions but let’s not worry about that for now anyway, you first have a system or an environment you have a state S_t

right and you have an intervention or action let’s call it a_t at the current moment you apply an action to this system you now hope to learn a predictive function or transition function F so that it can take your action together with your current state this environmental state to predict the next state

right, the state at the next moment so this is the most basic general kind of definition of a world model and this definition itself is actually incredibly straightforward or even somewhat trivial because this isn’t a new concept because actually back in 1943 there was this physiologist called Kenneth Craik, a Scottish philosopher and psychologist

mm-hmm who first proposed this concept he said humans have in their minds such a world model this world model can tell us when we take some action what consequences will follow mm-hmm because we can predict our actions the consequences our actions bring so this can guide us in what kind of action to take

and what kind of decision to make if I know that putting my hand in a fire will hurt, then I won’t put my hand in the fire this thing this kind of prediction structure is also from the past including control theory in the 1960s and 70s how everyone would put a lunar probe to the moon or send it to

wherever right and then everyone actually needs to be based on such a control system for example a classic algorithm called Model Predictive Control this also involves a Model but this Model is actually also a kind of World Model this algorithm is actually very very simple meaning you now need to decide what control signal exactly I should apply

to this system to enable it to complete a predetermined task mm-hmm, right and what I need to do is at the current moment roll out through my model to continuously output the next k steps of actions an action sequence meaning I need to output my next action sequence a sequence of actions

and through this action sequence use my Model to get the next step or the state at each step and finally I’ll also have a, um some kind of cost function a metric function which tells me after I execute this action sequence how far I am from my ultimate goal how far the distance is so this algorithm is very simple

you continuously sample your action sequence then jump back to the first step and find the action sequence with the lowest cost execute its first step then repeatedly iterate to do this action and roll out the next action sequence yes, so each time you need to make a decision and the source of this decision is based on your prediction of the future

mm-hmm yes, this is the so-called Model Predictive Control how people use this World Model and then later for example in Model-Based Reinforcement Learning in Reinforcement Learning people also realized that a World Model is actually very important alright there’s a classic paper here called Dyna

this paper is actually Richard S. Sutton’s paper — the father of reinforcement learning oh yes, so Richard Sutton himself wrote such a paper and he talked about ah a very interesting viewpoint or a framing he says the human intelligence system can perhaps be divided into two types one called a reactive policy

and one possibly called a more intelligent model-based policy right this thing actually, um this analogy is the so-called System 1 and System 2 analogy right, which is human cognition also has so-called thinking fast for very difficult problems we may need more mental cycles to study these problems

mm-hmm but for some problems for example when we drive, right when we first learned to drive we were very nervous looking left and right needing to make many decisions but when you truly learned to drive you internalize these decisions as part of your own muscle memory it becomes a reactive policy, right

so Richard Sutton in the Dyna paper said something very interesting he said, um what is Reinforcement Learning Reinforcement Learning is a very primitive a very basic model-free without this world model a learning algorithm ah so Richard Sutton himself was somewhat anti-pure Reinforcement Learning

at least at that time in his paper he talks about a better system which of course is if you have a strong enough world model you can based on the current state predict the next state right, and then you’d have this so-called planning capability which is planning the so-called ability to make plans

mm-hmm and then planning and reasoning are in some sense also the same concept reasoning is now very hot in Large Language Models but in fact, um this kind of planning we need and also the significance of planning for decision making was actually discussed very early on in Control Theory and Reinforcement Learning where everyone was discussing it

so I think this is the history of World Models so if we start from this angle the essence of a World Model is how to characterize a system and an environment such that you can make predictions in this system and this prediction can guide your your action sequence and your own decision-making large language models predict the next word

this predicts the next action based on this action predict the next state right how to understand state state is the minimum information that can describe all states of a system in that way a source of information, you could say you can think of it that way meaning a state

means, for example this thing also involves a very interesting thing very interesting another thing we need to discuss namely what exactly is the relationship between this and representation mm-hmm, right um, why do we say it’s the minimum information characterization unit it’s because suppose right now

our current physical world right let me say Earth ah, or let me not go that far let’s first talk about this room of ours right this is also an environment right so what is the state that characterizes this environment right, this state if you don’t pursue this so-called minimum information

or minimal descriptions then it can be for example, we now reconstruct this entire space entirely right and we precisely characterize all the parameters in this system including the texture of this table including our sound waves including we the mass of this table this microphone’s

various physical parameters mm-hmm, alright but we won’t characterize this system that way right because much of this information is not important for our decision-making right because actually if we assume an intelligent agent now living here for the purpose of we’re having a conversation

mm-hmm then I only need to know some basic facts for example, my microphone can stay on this table and then I won’t care about every point of lighting nor will I care about every detail of the texture on the table mm-hmm, right these things are all unimportant so this state

can actually contain a lot of information or can contain enough information meaning sufficient information this thing it depends on what kind of task you need to solve so what is this thing which is how to build such a state this thing is actually directly connected to representation learning mm-hmm

representation learning like I just said, right we need to have a hierarchical representation this hierarchical representation the purpose is actually how we can gradually develop layer by layer, iterating up and becoming increasingly abstract increasingly meaningful for my decision making and increasingly valuable representation

mm-hmm it won’t be fine-grained to every point it doesn’t need to be fine-grained to every point so how do you abstract mm-hmm and we also can’t be fine-grained to every point it just can’t be done right because this is very obvious right for example, say we’re building an airplane

this airplane for example every for example we want to model the dynamic system of this airplane right, I want to know how to make it more energy-efficient and fuel-efficient ah we can of course start from the lowest level we can say this per cubic centimeter there might be 10 to the power of

some ten-odd power of molecules and we model every molecular collision right and then through this approach to characterize our system this of course won’t work this is a totally stupid way right, what we do instead is how we can statistically study this problem so that’s why there’s fluid dynamics

and then there would be this Navier-Stokes equation and a series of such settings right, everything becomes increasingly abstract and then but the world we’re able to characterize becomes broader and broader mm-hmm actually language is in some sense abstraction language is some kind of abstraction but it’s a

proven abstraction it’s highly condensed meaning it’s an existing abstraction it’s an existing abstraction so what you want to build now is a new abstraction beyond language it’s a, yes it’s somewhat it must be a latent representation mm-hmm and this thing

people can understand indirectly what kind of representation you’ve learned or which representations which representations are meaningful all of this is fine it’s not a complete black box but it’s not constrained by the syntax of language and logic like that this is why I say LLMs are far from embodying The Bitter Lesson

The Bitter Lesson says you should minimize human knowledge as much as possible right put away your so-called human arrogance human arrogance and its so-called hubris this arrogance and its so-called cleverness and these so-called relatively clever structures minimize as much as possible

and instead do as much as possible using search and learning to find answers right, but you can imagine if what we’re discussing now is how to characterize this world ah language is exactly such a structure language is an extremely clever product of humans mm-hmm it has intricate design it itself is

it’s not a question of more or less it’s all it all is right, mm-hmm so so I think this represents language it has its own very strong points and it will definitely in future intelligence in all these intelligent systems occupy a very, very important position but it can do CoT (chain of thought)

mm-hmm but CoT is another matter CoT is also another um, how should I put it it’s a product of this stage right oh, CoT is also a stage-specific product everything about LLMs is a fairly stage-specific product oh that’s also why LLMs I also quite agree with Yann meaning LLMs

are actually not controllable not safe either because they don’t have a true world model we even use LLMs as world models but it’s fundamentally flawed it’s a flawed world model right and um what this means is actually, meaning all current controllability or safety how does an LLM do this

it’s entirely designed through fine-tuning to achieve it you need to feed it a lot of data to let it know what should be done what shouldn’t be done or what it can’t do what can be said what can’t be said right what kind of speech might bring danger what kind of speech might be more friendly

so this is called alignment but all of this is based on some kind of post-training or some kind of fine-tuning alignment mm-hmm yes, but a true world model actually you don’t need to do this because you can predict what consequence your action will lead to you can your what results your behavior will bring

you can then during inference process try to avoid such behavior mm-hmm you can add some external constraints to tell it you really can’t do this for example I have a robot holding a knife cutting vegetables right and how do I ensure now that this robot holding the knife

won’t turn backward and slash you how do you guarantee this from the perspective of a Language Model you you the way you can achieve this is through feeding it a lot of data mm-hmm right, but it needs to be able to see these things isn’t a world model, right a world model

doesn’t necessarily need a world model because you’re able to foresee this outcome meaning I’m able to take an action I can understand if this knife turns around now and creates a certain danger, what the result would be how do you let it know um, that’s part of your training about the world model

it seems the definition hasn’t converged yet for example, the world model you define and the world model Li Fei-Fei’s team defines what is the difference ah, right so what I just elaborated on is actually all the world model in our definition but I think the problems we’re encountering now are that this world model is hard to define

the reason is actually that it’s not a technical approach it’s not an algorithm it’s a goal mm-hmm meaning all of us whether you’re working on LLMs or Video Diffusion Models or Gaussian Splatting all of us are on the path toward the world model so I say

sometimes these competitions or these arguments I think before long maybe in 1 to 2 years will all seem extremely ridiculous because because we’re actually all developing toward this path and everyone knows this should lead to should be the right path it’s just that

everyone is thinking about this problem from different directions for example in our definition or let me first talk about other people’s definitions for example for a Video Diffusion Model company for example like like Sora like Bytedance’s models like Genie (developed by Google DeepMind) right, and then

all these models including Runway Luma every company making generative models is doing this all positioning themselves as World Model companies but they’re actually still mainly focused on building a world model simulator a world simulator the so-called world simulator mm-hmm their goal is still

to render visually compelling videos with some kind of consistency able to have sufficiently long content and so on, and you can apply controls to it mm-hmm, you can choose like Genie right take two steps forward take two steps backward you need to ensure you have some memory or whatever this thing

is their kind of world world simulator or this generative world simulator that wants to solve and um Professor Fei-Fei’s side at World Labs I think it’s more like a frontend an interface for assets this is also very important because it’s a strong 3D representation so

By the way also congratulations didn’t they just successfully raise funding if you can see their lead investors the people they’re discussing with for example I saw in the news Autodesk invested $200 million in them mm-hmm so what kind of company is Autodesk Autodesk is a company doing 3D modeling, visualization and CAD

or whatever design kind of company right so in this scenario you need a very, very concrete 3D one you also can call it representation it’s also some kind of representation but it means this thing is not an abstract concept right, it’s not hidden in your parameters it needs to have an explicit 3D

form there that way you can then in this space master some kind of spatial intelligence you can then explore in this space and you can be one hundred percent certain you won’t make mistakes for a World Simulator a Generative World Simulator this thing not necessarily right, although you can through longer context

have better memory but it cannot cannot be guaranteed mm-hmm and what we want to do is actually more like building a predictive brain yes, meaning we the core of how we view this problem is still about how to enhance intelligence itself yes, so that means you think LLMs are not intelligent enough

I think, again LLM is a crucial part of this intelligence system it’s a module but it’s not everything it’s not everything right let me give another example for example, why when LLMs do world modeling it’s fundamentally flawed for example let’s go back to this vision question

right, we’re now sitting here mm-hmm if we turn our head slightly say 5 or 10 degrees that generates hundreds of frames actually this frequency is very, very high the human FPS can actually perceive say, 100 Hz these frequency fluctuations extremely impressive right if you process this problem the way an LLM does

what would happen mm-hmm at least processing it the current way what would happen is I would need to tokenize every frame we flatten it stringing it into a very very long sequence every frame I can do some downsampling or whatever, doesn’t matter and then we string them together right, say I have 256 tokens per frame

now you might have 32 frames or 128 frames stringing them together then you’d have 256 times 128 tokens then you put them into a Large Language Model and align it with language and finally answer a question but does this make sense it makes no sense at all mm-hmm because you’re actually taking this kind of world

representation mm-hmm behind it there’s actually some kind of global state right you serialize it into a very very redundant token mm-hmm and Transformer people say it doesn’t have much inductive bias it actually still has some inductive bias its inductive bias is

it has to pay equal attention to every single token oh well, that itself is unreasonable right what this represents is the modeling technique of language models cannot resolve the cognition of these continuous spatial signals this doesn’t hold so this is why For us, when it comes to the world model we’re building,

I think it needs to have the following characteristics right, it needs to um, be able to understand the physical world and the definition here is that it must be the physical world although the world model application will also extend to things like digital agents to like a gaming agent will of course also benefit from the World Model

but I think its primary task is to solve the problem of physical world understanding and it needs to have sufficiently large associative memory Memory is also a very very important component of a World Model-based system as a whole mm-hmm and it needs to be able to reason able to plan mm-hmm

we just talked about planning able to able to do this kind of counterfactual reasoning or this kind of causal inference also very very important and the last point is that it needs to be sufficiently controllable and safe it needs to be a safe system right, I think all these things I’m actually borrowing from Yann on this

these talking points but I think these points are actually very very insightful right, not too many, not too few mm-hmm it and large language models are not in a derivative relationship they’re in a replacement relationship uh I think it’s not exactly a replacement relationship either

uh why did I just say that everyone in the field is moving toward world models moving forward? the reason is large language models also want to evolve toward world models actually that’s not quite what I mean what I mean is before large language models existed we couldn’t really talk about world models at all if you have a purely RL-based system

you’re purely doing overfitting to the current environment Large Language Models gave you a certain degree of cognitive ability about the real world it forms one element mm-hmm, it forms one element but this thing as I said, is fundamentally flawed because its cognition is too indirect yeah

what language can give you is really just too little mm-hmm, right and language has other problems too namely it is a fundamentally a communication tool so when we use language unless you’re saying something like in a dream state like talking in your sleep most of the time you use language with an intention

you want to convey a purpose so LLMs are more like in my view, more like an extension of a search engine right? or a chatbot is more like an extension of a search engine we always bring the purpose in our mind to ask a question and expect an answer right? but this is not what a World Model is

in essence as I just said the World Model in our brain is doing a lot of work in the background there’s even a lot of psychology some counterintuitive findings that say your brain has already made the decision for you before you decide to say there are three buttons on my desk before I know which button I want to press

I can already detect that my brain has already made that decision for me this experiment is called something like the Libet experiment or something it’s a controversial experiment but what it demonstrates is many things are happening in your background already happening in your brain this is part of your world model

a Language Model is not like that language is just a communication tool you always come with a purpose throw out a question and want to get an answer it’s also a reasoning tool right it’s also a reasoning tool of course, but only a symbolic-level reasoning tool so you want to build a world model like the human brain

I think we need to look more and more at people mm-hmm, actually not just people all kinds of animals how their intelligence actually arises mm-hmm, right let me, let me first conclude what I just said which is why is everyone step by step converging on converging on this World Model? the reason is language models

have already shown a bit of World Model-like behavior even though it has no actions it has no real understanding of the physical world and it can’t truly reason and plan because its planning through CoT and its reasoning through CoT is still very different from what I just described like MPC-level planning

CoT also brings its own set of problems but all that’s fine but the next step you’ll see for example everyone’s doing whether DiT or whatever model but people started doing generative models and that has made things somewhat different right? mm-hmm, and that’s why many people

who do video generation call it a world model I think that’s understandable although I don’t agree that the video generation model they’re doing is the final end game world model but it has indeed pushed one step beyond language models right how does it do that? on top of language models uh

I think all these systems now actually still rely on language models right? they still use language models to do prompt rewriting and then to help serve as a conditioning fed into the video generation model and language models have actually become you know the historical progression here is quite interesting

language models used to be the main thing now language models have become a preparatory step for video generation models a scaffolding in the old language models what you modeled was P(y) right? and that y is still in some semantic space information in some kind of label space mm-hmm, but now with video generation models

what you model is the probability P(x|y) what this means is what you’re modeling now is already x x is the data itself your y has become a condition — this is already very different okay why is it so different? it’s because when you have a low dimensional y space and then you go to model such a distribution

your probability density only competes within your y’s distribution meaning the likelihood you assign I’m getting a bit too technical here but anyway or let’s not talk about language models first let’s first talk about say a model that classifies 1000 categories you can think of

these few labels as a precursor to language it’s also a low-dimensional vocabulary right? and then if you’re doing a classification problem like this all the decisions you need to make are if this thing is a cat it can’t be a dog right? this thing is constrained by my label set mm-hmm

but when you start modeling P(x|y) when you’re doing a generative model the likelihood you assign in this case says what phenomena actually exist in the world which things are more likely to exist that becomes very very different right? because what you need to learn now the amount of intelligence information is far greater than what you get from modeling P(Y)

you need to understand why in this world a four-legged cat is more common than a three-legged cat right? why if I’m generating a video say I have, I don’t know a running video why would I have a smooth running state rather than suddenly hallucinating three legs four legs which is more believable

more probable, right? in probability space more probable this already carries enormous amounts of information what you need to model far exceeds what you need to capture in language space or in label space right? you already need some understanding of the world so this is already more in line with the Bitter Lesson in my view

meaning you’ve abandoned more of the cognition in language space and its logic and its syntactic structure and started modeling pixels started modeling the pixels themselves but taking it one step further pixels themselves might also be wrong pixels themselves are also not Bitter Lesson enough

mm-hmm what are pixels pixels are a human-defined regular grid just a grid of little boxes each little box might have 8 bits of information and you might have this kind of lattice like a cell by cell by cell arrangement this is a pixel this is each frame of the image we see right?

this is also an interface mm-hmm this is also made for humans to see right? that’s why world simulators why do people think Genie is so cool because we create a video we create a game this is for humans to see but taking it one step further the real Bitter Lesson says

I don’t need to make it for humans to see why do I need to make it for humans? right? who is it for? it’s for your system to see it’s for your world to see mm-hmm it depends on what you ultimately want it can be for humans to see but being for humans to see is not the core of a World Model

it’s the interface of the World Model the World Model itself is spontaneously learning better representations making better predictions right? but this thing itself whether or not you want to generate a cool video is actually irrelevant and whether or not you can answer some questions about your input space

is also actually irrelevant so again let me repeat what I was just trying to say each of us is moving forward on the road toward world models the world model is a goal not a specific path uh, not a specific algorithm or a specific technical roadmap and someday we will have a better world model

mm-hmm language models will, on top of that also get stronger we’ll have better multimodal models that can better understand the world and we’ll have better video generation models mm-hmm and I think RAE is an early prototype in this process mm-hmm, yeah so now there’s also a very hot concept

the so-called Unified Model or Omni Model where people try to stack all the data together so that we can have one system that can do both understanding and generation what people also discuss is does understanding help generation or does generation help understanding mm-hmm I think neither really matters

understanding and generation are one both need a real World Model as their foundation right once you have that good World Model that can do some kind of prediction can do some kind of planning and reasoning the upper-layer decoding is actually very very simple so you think they’re all built on top of the world model

which is the base layer right you can think of it as what we want to do or what the representation school wants to do is the very bottom layer of the cake this base the representation school how to unify representations into one unified meaning unifying it with language ultimately unified into some kind of representation

abstracted into a few abstract representations so you still need scaling, right? you still need to besides language, what other scaling can we currently see? language scaling we just touched on this language scaling itself I think is again something a bit hard to articulate clearly because we also know

there’s a theory which says compression is intelligence right? compression equals intelligence compression equals intelligence yes, but what it’s saying is your language model is actually a lossless compression process or rather, language models getting bigger improving results is not because it’s memorizing by rote

having memorized all of this content it’s simply a stronger model so it can have a better compression ratio to compress all of your input information it brings some kind of generalization ability I think I agree with this view but I want to step back a bit I want to say actually because of the nature of the problems language models care about

its Scaling Laws actually contain some padding which is what I mean by padding is it doesn’t actually need the smallest model to answer questions by truly understanding the world it doesn’t need that and all our benchmarks and what humans use Large Language Models to achieve on these tasks

also require it to be able to retrieve right, to be able to be able to retrieve factual knowledge if a model right, can’t tell me say a specific person’s name on Wikipedia what they did in the past that’s a very poor Large Language Model so so what I want to say is the Scaling Law of language models

is based on a representation of knowledge that’s the Scaling Law derived from that so that’s why it may have a relatively balanced ratio meaning your number of tokens your data and your parameters need to be roughly 1:1 that’s how it works one approach right? then scale up world models, especially visual intelligence-based

world models I think will have a very very different Scaling Law it will have a Scaling Law but the slope of that Scaling Law may be completely different or its ratio may be completely different my current intuition is the model won’t be that large the model doesn’t need many training parameters because you don’t need to remember

if you want to do video generation that’s a different story but you don’t need to remember everything all the subtle details in the world that you can see you don’t need to solve some definite equation in some very high-dimensional space to determine whether an apple falls mm-hmm it doesn’t need to do these things

it doesn’t need human intelligence the highest level of human intelligence let’s discuss what human intelligence actually is but anyway it doesn’t need these things it doesn’t need to memorize all this knowledge it needs good understanding capability to filter information processing and filtering out information

and then because ultimately what really matters is the decision itself mm-hmm right so so this will become more and more like humans because that’s how humans are humans have many very important facts right? like the human visual system or rather all of human sensors combined

including hearing, vision, smell touch, all of these this is actually extremely high bandwidth this bandwidth might reach say 1 billion bits per second in the range of 100 million to 1 billion mm-hmm but when we’re talking right now the bandwidth is extremely low the bandwidth is only ten to

ten to one hundred bits per second mm-hmm so what’s actually happening? right? what kind of model is our brain that at twenty watts of power takes in one billion bits per second of information through our eyes and all kinds of sensory inputs and converts it into 10 bits per second of behavioral output

this is the World Model itself it filters out large amounts of useless information and noise right, there’s a lot of redundancy it knows what’s important and what’s not important so the filtering system is very important right, of course this is also a hierarchical filtering system mm-hmm mm-hmm, that’s indeed the case

so how do you train this world model? uh, language models are easy to train because internet information is just sitting there so you can train it but with world models, it seems like I don’t even know where to begin right, I think this is the biggest bet because the closer you get to the essence of intelligence things become

much harder mm-hmm, right I think like you said we went through the period of dumping the entire internet to train models that era I think going forward uh I honestly don’t know if this path will work I have enough confidence but if you asked me whether it’s 100% guaranteed to succeed

not necessarily the reason still comes down to data can we actually pull this off to the fullest extent how much data does it need? what kind of data? I think the past era was about dumping or downloading, I should say the Internet era now the era is about downloading the human era mm-hmm

we need to download humanity mm-hmm so right now, again right, everyone processes this knowledge we have something called the Internet we can upload it we can train a Transformer everything is good but for truly understanding the world a 4-year-old child the videos they’ve seen — Yann often cites this example

already exceed all the tokens used to train all of these large language models right? a four-month-old baby the amount of video they’ve seen exceeds all 30 trillion tokens of the best large language models’ data right? so this magnitude is just enormous so when I said we need to download humanity

the data that human eyes see how do we actually collect that data? right? I think video is still that’s why before I was still very eager to do more work on video related research I think this is the best hope we have right now right, mm-hmm oh this might have a very high barrier but I don’t think it’s necessarily impossible

I think we can proceed in several stages first we can start with internet data start with YouTube mm-hmm as I was saying no matter what all of these training tokens tens of trillions of tokens versus a four-month-old baby who has seen this much information all that data equals 30 minutes of YouTube uploads

there’s a massive amount of data on YouTube mm-hmm is there a copyright issue with that? uh everyone knows there are copyright issues and everyone everyone is continuing continuing to do it anyway mm-hmm, yeah I think at some point there will definitely be major copyright issues or rather this isn’t just a copyright issue

because YouTube may not own the copyright to these videos but it’s a terms of service issue YouTube prohibits you from scraping this data which makes this data extremely hard to collect basically impossible to get you download a few videos and YouTube blocks your IP and then you have to switch to a new IP right, so it’s kind of

now I think uh these data companies and these platforms are in this cat-and-mouse dynamic mm-hmm one side one side is tightly guarding against data collection blocking you from scraping the other side the other side is trying every means to get more data mm-hmm, right I don’t know how it will end

right wow, ByteDance has such a huge advantage ByteDance has such a huge advantage and ByteDance doesn’t care right? but they’ve received a lot of cease-and-desist letters too so I don’t know I think going forward there may be more right, but I think this gets into human society’s more political optimization

mm-hmm, alright step one is video step one is video and then next running in parallel is I think this kind of world model or this very vision-centric world model will have some very promising application prospects because I think doing only research isn’t enough the reason LLM succeeded

is also because the chatbot interface was so successful so natural it relies on the internet on mobile devices but it’s a very good interface a very very good product so even OpenAI’s own people didn’t realize it right, but when we talk about world models especially

the world model we just defined what is the ultimate product exactly? I think this might be the real hard problem mm-hmm maybe an even harder problem than data so right now if I just brainstorm ideas off the top of my head the ideas might all be wrong in the end but there are at least two outlets

one is something like AI glasses this kind of truly personal assistant this needs a World Model with only a language model that’s not enough with only a language model it’s still just ChatGPT but with a screen and voice interaction right? it can’t break out of that product form for example I often give people this example

I’m now wearing some wearable devices they’re not real AI wearable devices right? but somehow they possess some traits I think are world model-like mm-hmm the reason is they’re an always-on device it’s always on always monitoring your body signs right? and there’s a large amount of information

because every second right, I’m not sure at what frequency at what frequency it collects this information but my heart is always beating so it can always track this information and then where does this information go? right? this information itself is meaningless to me knowing my heart rate BPM at a certain moment

has no meaning to me at all so it needs intelligent decision-making to tell me you seem to be under too much stress right, you’re under too much pressure now you need to slow down and then saying your sleep hasn’t been very good the past few days you might need to consider some remedial measures or maybe you should take a day off today

right? I think this is actually quite world model-like except this is the most basic world model possible because the information it can get is just too little mm-hmm it’s very narrow information right, very very narrow mm-hmm, right? but I think this is a glimpse of a future world model

in AI wearables mm-hmm because if we imagine there were actually glasses or, right I know you don’t like wearing glasses but suppose there were some kind of wearable device that could truly be always on we don’t know how to solve the power consumption issue never mind the hardware issues let’s set that aside

but it could see in real time everything we can see right? with completely always-on and infinite tokens flowing into the system mm-hmm I think this actually has enormous potential and first of all I’d really want this thing because I want to know at what time I drank a coffee

and whether I drank that coffee an hour too early or an hour too late causing my sleep that night to not be as good or say I’m an athlete who wants guidance on every movement or say I work in a hospital and I want to equip every elderly person in the nursing home with such a wearable so I know what their daily behavioral patterns are

what medications they’ve taken what they’ve been doing ah how they’re feeling emotionally right, what their condition is mm-hmm, yeah and link it to their medical records in the background and provide better intelligent decision-making I think there are many many similar examples right, but this is based on current LLMs

existing multimodal intelligence which I think actually can’t do this mm-hmm and then another outlet we also just touched on this I think it’s Robotics I think Robotics faces the problem of the brain not being good enough mm-hmm and even if it can do martial arts

it can perform of course you can’t deny that’s also a good vertical domain right, the entertainment market might also be quite big so let robots go perform then I think that’s fine too but this is far from a general- purpose robot that can enter every home carry elderly people up and down stairs

take care of their daily needs this is still extremely far away mm-hmm, robots that can actually work are still a wasteland [laughs] yes, yes oh and I think this part you can see robotics is actually a very good downstream application because no matter what new upstream we talk about in the broad world model sense

like these glasses ah robots can benefit from it mm-hmm for example LLM came out and we had VLA, right? that was hot for a while now video diffusion is doing well action-conditioned video diffusion is doing well right? this generative approach this world simulator doing well

so we’re also discussing how robots can use these models to do a better action planning right, there’s a lot of work like that so as I said I think there’s still a long way to go here and then but I think watching robots online watching robots on the Spring Festival Gala

versus in private talking to researchers in the robotics industry the feelings are very different how so? the latter the latter are willing to tell me the truth oh that doesn’t mean they’re normally being dishonest just that the latter are more willing to tell me exactly where the shortcomings of current systems lie

why does this sound like it can work but existing models just can’t solve it so we just talked about your decade-plus long research journey how did you make the jump to world models? mm-hmm I think there wasn’t really a jump as I’ve been saying throughout I think what I call representation learning

representation learning world models and the entire development of AI is actually a fairly smooth transition and I’m actually not a big fan of the term world model as a label I think it sounds a bit hyped and now it’s become a kind of catch-all term for everything and everyone is claiming they’re doing world models

I think this on one hand I think it’s true that I don’t think it’s exactly a uh a researcher would enjoy this kind of process but on the other hand I think a field moving forward may still need some of these buzzwords and I think if I had to name something I might appreciate one thing

about the world model about the so-called World Model and that is this this comes from Jitendra Malik, a professor at Berkeley he said the one thing he likes about World Model is that it lets him tell people I’m doing a World Model not a Word Model word as in W-O-R-D word word, right — I’m doing a world model

not a word word model and a word model is an LLM I quite agree with that so I think as I keep repeating, I think I think world models are a destination that everyone will eventually reach it’s a goal right mm-hmm, actually as you started pursuing world models

you also made a very major decision which is to start a company — this is a very big very different choice from your previous research career a different choice why did you make this choice and how did it come about? oh this decision was also something of a metaphysical one metaphysical oh well

this people might think I’m being too mystical about this but it really was because before, I had many friends in the Bay Area some mentors who’ve been very helpful to me and some of them may be investors in that capacity or other entrepreneurs and they said Saining, you should also try starting a company

mm-hmm because at the university as I was saying earlier resources are scarce right, but that doesn’t mean university is worthless I think university is actually a very good platform it gives me enough space to truly find what I want to do but I suddenly felt that now seems like a moment

where what I want to explore has been explored to a certain extent and going further might fall into what I call the medium paper trap [laughs] like the middle income trap meaning you’d publish decent papers but because of resource constraints you can’t truly turn your your ideas into what might be a new breakthrough in some sense

right, so I thought this might be a good moment and then so I had a manager who asked me it was at quite an interesting moment probably about last year probably around year-end or maybe it was in the fall year-end of ’25 mm-hmm, right year-end of ’25 and he said go ask Yann LeCun

he seems to not be very happy at Meta lately but at that time it wasn’t actually that turbulent yet Alexander Wang hadn’t come yet (Scale AI founder, joined Meta as Chief AI Officer) and like the layoffs at FAIR and all that turbulence my first instinct was oh, how could that be? right, Yann, right? we can later

talk more about what kind of person Yann is but at least at that time I would have thought he’s still the godfather of AI, right? and he is a pure researcher how could he be pulled into a startup? and then we had this conversation the Monday two weeks after that we happened to have a one-on-one meeting

a one-on-one meeting with Yann LeCun yeah and before I could say anything Yann said to me, hey Saining, don’t tell anyone yet but I’ve already decided this what I want to do now should be done outside I want to start and build a company and then I asked him what do you want to do?

what’s the business model behind this? mm-hmm and then I realized wow this is completely aligned with what I’d imagined mm-hmm, very interesting right, and what is this thing? I think you can you can call it world models or the logic behind this is I think on the thing I want to do in the current

any country in the world I don’t think it can be done including in the Bay Area can’t be done in Silicon Valley either so what is this thing? that is to say you still need a certain degree of research depth right? it’s not completely saying, hey we now have a Large Language Model we want to deploy this system

and push to product and then go get some revenue it’s actually not like that right? and I think this has a strong research-oriented inclination mm-hmm, right? but it’s also not in a purely academic academic setting it’s not the old FAIR and it’s not NYU either

it’s not a university and it’s not the old traditional FAIR either but on the other hand it’s also not the Bay Area’s big tech companies and the many neo labs now operating in a completely closed manner what does closed mean? closed means you don’t open source you can’t publish papers

and like the blog I mentioned mm-hmm you can’t put your name on it can’t put your uh name on it and like when I was actually at Google at GTM I was in GenAI and I was the only one there who had, in a sense, a foot in both worlds a double affiliation still doing things at the university

people there actually have some resistance to academia to this kind of purely exploratory research that’s the Bay Area’s current state right resistance how do you understand that? who’s resisting? resistance means first, I think people look down on the work academia is doing

they don’t think academia’s work can truly ah generate any kind of impact second because they also don’t publish a lot of things you don’t know what they’re doing right? even within these big companies actually some large companies have research departments and more product-oriented departments

but even between these two departments in the same company there’s still a big divide because again, the side doing say core model training at these companies, these departments need to be in this highly competitive race mm-hmm at the very front that’s their only goal it’s an arms race

it’s an arms race mm-hmm and this squeezes out your research space mm-hmm it it sucks away the oxygen in that environment the oxygen that gives you sufficient freedom to do research mm-hmm, so you never considered joining any lab you couldn’t stand that suffocating feeling yes

I think this is also a very interesting phenomenon the phenomenon being there were indeed some opportunities back then and I was considering other options too and but after thinking about it I felt that maybe this if you really want to do truly cutting-edge exploration if you want to define the problems you probably have to do it at your own startup

for that to work mm-hmm, someone else’s startup means they define the problems and you come to execute that’s other startups well first of all I don’t think among all these other startups there’s any single startup or any big company that’s focused on what we’re doing what is called building the predictive brain

right? working at what you might call the most foundational layer or the most upstream layer doing things there that simply doesn’t exist even more interesting is actually many of my friends when I talk with them everyone realizes this is actually necessary as I just said this thing

on one hand is somewhat of a counter-consensus view right, a contrarian view but on the other hand over the past year it has gradually become a consensus so what I’m saying isn’t all that new nothing particularly new mm-hmm but I briefly mentioned I think in the entire AI industry right now

there’s this enormous AI this kind of value chain at the very top of this value chain as I just said there’s Bitter Lesson there’s a narrative of AGI and LLM this has defined a series of benchmarks mm-hmm right, so you compete on leaderboards mm-hmm, mm-hmm and you just compete

the leaderboard might be LLM Arena or other leaderboards right, there are a series of benchmarks these benchmarks define resource allocation meaning how you allocate resources mm-hmm right, because my goal if it’s to be number one on the leaderboard then I can only pour in the most resources

to be able to compete at that level and then resource allocation actually means this has already drifted somewhat from what researchers think is right or wrong although some very strong researchers know we may need to do some research but under this value chain resource allocation means they can’t do this part of the research

so for example I think hmm video understanding is actually quite important but now it seems neither academia nor industry is doing much of it or people are doing it but not with a fundamental World Model angle to approach this problem to solve this problem but why is that? but this is a very interesting phenomenon

you’ll see it’s not that no one is willing to do it it’s not that no one has the ability to do it mm-hmm it’s that all of them, without exception regardless of which company without exception have been assigned to a video generation model team mm-hmm because this is the only

one within this value chain that can indirectly participate in this value chain position even though they all know we haven’t solved this problem we need a better as I just said a World Model based video understanding model and this might be an important prerequisite

for actually training that World Model but people won’t have space to do such exploration mm-hmm so back when I was at Google I had that frustration too including when we did the RAE paper this paper took about this student and with Boyang Zheng we probably spent almost a year because this student in between might also have

had some health issues anyway there might have been some gaps in there right? anyway, to finish this work it took us a year mm-hmm when we published this work I was actually a bit worried I thought hmm would there be some Google researcher coming to me saying why did you publish a paper

we’re doing the same thing you’ve exposed our secrets mm-hmm turns out yes oh several researchers came to me and their feedback was I think this is right I worked on this for two weeks but my manager said you can’t do this anymore we have product cycle one coming up

product cycle two product cycle three, right? these product launch timelines need to be completed their motivation is different their motivation is different so it all comes back to I think we need to return to what we discussed at the beginning in this kind of finite game in this highly competitive environment

every company seems to have lost its ability to define problems for example you see that before, like OpenAI, right? they actually had that ability mm-hmm many of these problems were defined by them right? including GPT including models like CLIP right? or say from their very first day

as a research unit they had this kind of problem-defining capability mm-hmm right? but now it seems like even OpenAI to some extent is being swept into this race mm-hmm, of course they were once the ones who defined the race now they’re the ones being competed against mm-hmm so I think the AI industry right now

needs new problem-definers and Yann has this conviction that the current path mm-hmm cannot lead to true intelligence right? so someone needs to define new problems on this larger scale I think Yann and I share a lot of common ground on this matter mm-hmm, so you found a kindred spirit yeah, that’s a better way to put it

mm-hmm so then you started the company right? then you mentioned Yann let me ask you what kind of person is Yann? what’s it like working with Yann? mm-hmm Yann is a very unique person mm-hmm I’ll start with a few of his characteristics mm-hmm

he’s very principled mm-hmm and I think his principles are very rooted in his deep understanding of the problem itself mm-hmm which is why he when he says something is right I think he truly believes in what he says mm-hmm and won’t be swayed by other people’s opinions mm-hmm and I think this quality

in the current research environment is actually very rare mm-hmm because most people well first of all researchers are human beings mm-hmm they also need to consider their career their citations right, their impact factor mm-hmm and follow the trend when everyone else is doing LLMs

I should also publish some papers on LLMs mm-hmm but Yann clearly hasn’t done this mm-hmm right? and for me I feel like I also belong to this type of person mm-hmm second I think Yann is from my observations a very good leader mm-hmm right, how so?

how so? Yann’s leadership style is he actually doesn’t manage people much mm-hmm mm-hmm and Yann’s approach to leading is through his vision mm-hmm and through what he stands for and all the values that he represents mm-hmm to attract people to join him

mm-hmm and then he’ll also give you a lot of freedom mm-hmm he’s very empowering mm-hmm, that’s great right? and I think this is a style that works best for me because I also don’t want to be managed very much mm-hmm mm-hmm, so you two get along really well mm-hmm

yeah, I think we complement each other mm-hmm because I think Yann is more of a visionary mm-hmm and I’m more sort of more grounded someone who can actually execute mm-hmm good at figuring out given Yann’s direction what should we specifically do mm-hmm

so I think this pairing is interesting mm-hmm yeah, I feel like Yann also has this kind of very outspoken internet celebrity vibe [laughs] [laughs] very outspoken person right? and you’re relatively more low-key? mm-hmm, mm-hmm yeah, I think that’s relatively true

mm-hmm I like speaking through work mm-hmm okay, so then you co-founded this company together mm-hmm and then you’re in New York right? let’s talk about New York mm-hmm why not Silicon Valley? ah, this question this is indeed a question a lot of people are very

curious about right? uh I think first of all honestly I’m a New York person myself I’ve been at NYU for many years mm-hmm and Yann has been at NYU even longer than me right? and the feeling of New York, speaking truthfully is very different from San Francisco

mm-hmm I’ve been to San Francisco many times and I’ve lived in the Bay Area mm-hmm but the Bay Area atmosphere is really a pure tech bubble mm-hmm but you know what it’s not necessarily a bad thing mm-hmm in that bubble everyone can be very focused on doing one thing

mm-hmm so the entire Bay Area culture is just about building companies, right? mm-hmm and New York is I think, a more real world mm-hmm this real world in New York has given me many inspirations right? and then many of the ideas around the product especially the kind of embodied AI products

or world model products I’ve imagined actually come from life in New York mm-hmm right? and then also in terms of recruiting I think many people in New York have a stronger desire to do something more fundamental mm-hmm right, because the Bay Area is actually quite saturated now

yes in terms of talent it is saturated but in terms of culture everyone is doing product, product, product mm-hmm right? so I also feel that for what I’m doing New York might be a better fit mm-hmm mm-hmm, yeah right, as we talked about earlier

there are actually many AI startups in New York and there’s quite a vibrant AI scene in New York right? but New York still doesn’t have an absolutely top-tier AI company like OpenAI-level right? I think that is also an opportunity mm-hmm

right, Hugging Face is in New York mm-hmm mm-hmm, well Hugging Face is headquartered in New York but their team might be quite distributed but their HQ is New York so I think this is a very interesting trend mm-hmm okay, so then let’s talk about the current state of the company how many people do you have?

how’s it going so far? mm-hmm right, so we’re still very early the company is only about six months old or so mm-hmm and we currently have about 15 people mm-hmm the team is very very strong how big will your pre-training dataset be? ah, these things

that’s the research part right we actually now have a very good roadmap and we’ve also hired many many people everyone actually cares a lot about how to make something land in reality not just simply doing research although research is very very important and now if we want to achieve the goal of a truly good world model

how much compute does it need? mm-hmm I think compute is definitely needed but as I was saying earlier I think the compute efficiency will be very very different mm-hmm so the amount of compute might not be comparable to training a frontier LLM mm-hmm but one thing I think is very important

is the structure of how we use compute mm-hmm right, there are many ways to use compute for example you can use compute to train language or use compute to train video mm-hmm or you could train both simultaneously mm-hmm I think for our approach the distribution of compute might be very different

mm-hmm um a larger portion might be used on video mm-hmm but not just the kind of prediction-based purely the kind of prediction target, right? this approach mm-hmm but a combination of generative and discriminative methods and then with a combination of language too

right? mm-hmm so I think the goal is through the least amount of compute possible to train the best world model mm-hmm right? and then in doing so you also need to be able to make a product mm-hmm right? so it’ll be a long journey

mm-hmm but I think the path is relatively clear to me mm-hmm yeah right, well you did also mention Yann right, earlier you mentioned that before you started the company you were at NYU as a professor and also had a collaboration with Google right? you were in quite a good position

mm-hmm and then you made a decision to step out and do this mm-hmm what was the tipping point? or the final straw that made you decide okay, I’m going to do this mm-hmm I think it’s a combination of many things but I think the biggest factor was the conversation with Yann, as I mentioned

mm-hmm because I had never considered that Yann would want to do this right? mm-hmm and once Yann decided he wanted to do this mm-hmm the whole thing became a lot more compelling mm-hmm because I think with Yann doing this kind of thing is much more legitimate

right, meaning it’s not just two or three young researchers thinking they can change the world right? right, and Yann has the experience the vision and the prestige mm-hmm to attract talent attract investment right? so I think this is when I found out about this

I basically decided immediately mm-hmm without even thinking about it much right? I think this kind of opportunity is once in a lifetime right? mm-hmm, and also I’ve always said I actually really like Yann mm-hmm right? and I feel like having the chance to work closely

with someone like Yann is something very rare mm-hmm mm-hmm, so that’s also why you didn’t hesitate mm-hmm yeah alright, so last question mm-hmm if you had to send a message to the Chinese AI research community or students who are interested in AI research right?

what would you want to say to them? hmm I think there are a few things I want to say mm-hmm the first thing is about attitude mm-hmm I hope everyone can keep thinking for themselves mm-hmm don’t be swayed by trends mm-hmm I hope everyone can

think about what they really want to do mm-hmm and why they want to do it right? because I see many people in AI research and many people are doing it but actually sometimes it’s a bit following the crowd mm-hmm right, because it seems like this field is hot

mm-hmm so let me get into it mm-hmm but actually the more important thing is you yourself have a genuine passion for this kind of creative work mm-hmm you genuinely want to figure out the essence of intelligence right? mm-hmm if you just see this as a career path

that’s also fine right, if you just want a good job mm-hmm but I think for researchers or people who really want to push the frontier right? mm-hmm I think this genuine love for the work is really important mm-hmm the second thing is about approach mm-hmm

I hope everyone can think about problems more deeply mm-hmm right? I think a lot of current AI research is quite shallow mm-hmm meaning a lot of it is just following what others are doing mm-hmm right? people follow trends

mm-hmm but the most interesting things come from people who ask why? mm-hmm why does this work? mm-hmm why doesn’t that work? mm-hmm what is the essence here? mm-hmm and I think this kind of thinking deeply about a problem is a quality that’s becoming rarer

mm-hmm so I hope people can cultivate this quality mm-hmm and the third thing is about community mm-hmm I hope everyone can be more open to collaboration right? mm-hmm I think one of the beauties of the AI field is it’s a very open field mm-hmm

right, many papers are open many code is open mm-hmm right? and this openness has driven a lot of progress mm-hmm I hope this spirit can be maintained mm-hmm yeah, thank you Saining mm-hmm this has been a very good conversation thank you thank you

mm-hmm okay so now let me introduce the next guest mm-hmm this next guest is also a very very special person mm-hmm he is a PhD student currently at NYU mm-hmm but he’s not your ordinary PhD student mm-hmm he’s also an entrepreneur

mm-hmm and then we just learned mm-hmm that he’s also Forbes 30 Under 30 wow yes this is very impressive mm-hmm let’s welcome mm-hmm Zhiyuan Zeng (Tommy) mm-hmm hi everyone hi hello

mm-hmm alright, Tommy why don’t you first introduce yourself mm-hmm sure, hi everyone I’m Tommy currently I’m a PhD student at NYU and my research direction is AI agents mm-hmm and at the same time I’m also the co-founder and CTO of a company

called Simular AI mm-hmm and the direction of this company is also AI agents mm-hmm specifically we are building a desktop AI agent mm-hmm the product is called S2 mm-hmm cool, desktop AI agent right? does it work on a computer? mm-hmm yes, it works on a computer

mm-hmm then I want to ask you what exactly does it do? mm-hmm right, so this thing basically can do everything you can do on a computer mm-hmm for example browsing the web mm-hmm writing code mm-hmm managing files mm-hmm using various applications

mm-hmm right, using various software mm-hmm right? mm-hmm so it can help you do tasks on the computer mm-hmm so it’s more like a full automation of computer tasks mm-hmm yes, it’s a computer automation tool right? and it can handle more complex tasks

mm-hmm right, like what? for example say I need to book a flight mm-hmm but this booking involves multiple steps mm-hmm like opening a browser going to a website searching for flights comparing prices mm-hmm and then ultimately booking it

right? mm-hmm all of these steps S2 can automatically complete for you mm-hmm so you just tell it what you want and then it does it for you right? mm-hmm yes mm-hmm, that’s pretty amazing mm-hmm right? then tell me what’s the difference between S2

and similar products out there? mm-hmm right, so I think S2’s biggest differentiation is mm-hmm reliability mm-hmm right? because right now many similar products might be able to demo well mm-hmm but in actual use the reliability is not so good

mm-hmm right? because computer tasks are inherently very complex mm-hmm there are many unexpected things that can go wrong mm-hmm right, like pop-up windows mm-hmm or maybe the website has changed its UI mm-hmm or maybe the network is slow

mm-hmm all sorts of situations mm-hmm right? and S2’s solution is we built a proprietary model specifically for computer tasks mm-hmm so that it can handle these complex situations mm-hmm right? and at the same time we also have a proprietary planning module

mm-hmm so that it can plan more efficiently mm-hmm right? mm-hmm, so it has a self-developed model mm-hmm right? mm-hmm a proprietary model mm-hmm so to do this you need a lot of data right? mm-hmm how do you get that data?

mm-hmm right, so data is indeed one of the biggest challenges mm-hmm right? so our approach is to build a data synthesis pipeline mm-hmm right? we use AI to generate data mm-hmm right? and then use this data to train the model

mm-hmm right? mm-hmm, and where does this synthetic data come from? mm-hmm right, so the synthetic data mainly comes from we have an environment mm-hmm this environment simulates various computer tasks mm-hmm and then we have an AI agent in this environment

completing these tasks mm-hmm and recording the process mm-hmm right? mm-hmm so this is the source of the data mm-hmm right? mm-hmm, that’s clever mm-hmm right? mm-hmm so then tell me who are your target users?

mm-hmm right, our target users are mainly knowledge workers mm-hmm right? people who spend a lot of time on computers every day mm-hmm for example software engineers mm-hmm data analysts mm-hmm right, product managers

mm-hmm designers mm-hmm and so on mm-hmm right? but I think trying to accomplish something this different is still quite difficult because as I said I’ve been emphasizing all along we’re actually looking for a kind of balance this balance means

it’s neither a purely academic research lab nor is it one of today’s closed large-model companies Mm-hmm and this balance also means take me personally, for example it’s also a kind of balance it’s like I’m neither a very senior already accomplished and established kind of distinguished professor

but I’m also not an eighteen or nineteen year old who can just roll up their bedding and head to a factory in Shenzhen [laughter] and set down roots to do data collection or whatever I’m neither of those Mm-hmm some of the data comes from factories in Shenzhen Yes someone is doing it

the example I just mentioned is a specific company they have a company called build.ai I actually really admire that kid named Eddy he took a few people and dropped out of Columbia then went and lived in a Shenzhen factory Ah and then build a startup like that I think that’s so impressive

right I think this is both about finding balance but I find it challenging for myself but it’s also a new opportunity I think maybe maybe this era Uh might not belong to the old guard nor to the young guns but rather to a generation of mid-career entrepreneurs You said no to Ilya (SSI founder) twice

but said yes to LeCun Why is that? What kind of person is he in your eyes? oh right Yann is a fighter online right? actually firmly opposed to the LLM camp well, it’s not just opposing LLMs he actually doesn’t oppose LLMs he’s never said he opposes LLMs he’s very

he even says he uses Gemini himself he’s completely fine with LLMs he just opposes the narrative that LLMs can lead to human-level intelligence that’s the narrative he opposes that’s what he pushes back on Mm-hmm he has no objection to LLMs at all but anyway he’s a fighter online

constantly engaging in battles but I think privately he’s a really wonderful person he’s someone I genuinely admire and look up to from the heart Were you close before? we collaborated on some papers but definitely not like being in a startup together as co-founders like working closely like this

we hadn’t done that before Are you close with Kaiming? definitely not mm-hmm, right Yes but I think I think Yann is someone who truly can distort the reality field I think he’s incredibly, incredibly impressive whenever I start to have doubts about something I always want to go have a chat with him

he can easily make the people around him at least that’s how I feel feel a sense of calm feel like, hey these challenges aren’t really challenges the road ahead is bright yes, he has that ability Mm-hmm and moreover of course his research vision I deeply admire as well

admire like many of what I just mentioned such as what a world model is why we need to filter information this is essentially also JEPA the core of the JEPA idea he proposed is that you can’t build a general model you can’t memorize everything and reconstruct it all you need to work in an abstract representation space

to make predictions in an abstract representation space Mm-hmm that’s the core of JEPA but what I want to say is Yann, I think, really practices what he preaches he himself is pretty JEPA as a person he consistently holds fast to many of his so logical principles and the things he believes are right this

is undisturbed by anything external but this doesn’t mean he’s completely stubborn who won’t listen to anyone that’s not really the case sometimes he’s been wrong sometimes he’s been right he’s right most of the time but he can actually take in what people say mm-hmm, and he also said

there was there was a press piece about how Yann can’t be moved that Yann LeCun can never be moved right, no one can move him Oh meaning he’s stubborn, right? saying he’s too stubborn Yann said I can absolutely be moved I can absolutely be moved but I need to be moved based on facts

not just because someone tells me what to do and I go do it that’s when I’ll be moved so back when he was at Meta actually Mm-hmm many people also told him we at Meta are now going to build Large Language Models we need to do all these things you can’t keep saying these things publicly anymore right?

you can’t go around constantly dissing Large Language Models as not working Yann couldn’t accept this at all Yann said my integrity as a scientist my integrity as a scientist cannot accept this so I think this is something I deeply admire too I think he truly the things he says Mm-hmm aren’t because something is now

trending and then he goes and says it everything can be traced back to its origins including his talk about world models he didn’t just start talking about it because world models became popular recently it was also something he was already talking about many, many years ago and he also has a really great paper I I genuinely recommend it to everyone around me

it’s called “A Path Towards Autonomous Machine Intelligence” right it’s his position paper also an opinion paper and at that point you’ll find there are many layers to his thinking these layers are presented in a very engineering-oriented and implementable or mathematically expressed form

so you see, when people ask him Yann, what exactly is a world model he never says something vague and high-level something relatively abstract and empty he’ll always write out formulas for you Uh he always will still does now still does now and he still spends one day a week at NYU

and still leads his own group he still holds group meetings during group meetings he walks up to the whiteboard and walks everyone through the equations step by step Mm-hmm highly technical very, very technical right What’s the division of responsibility between you two? Yann is executive chairman

so he’s more like the captain of our big ship about this with him I also talked with him about it who’s the captain he’s the captain no, I’m not talking about who’s the captain I don’t want to be the captain right, right, right, but he said on one hand he said

he really doesn’t like managing day-to-day operational matters he’s not a good CEO but on the other hand I feel — you’re not either right, I’m probably not either but I also think he’s a very wise manager he gave me this example he said his management philosophy is like sailing a boat

this by the way, that’s one of his hobbies I can talk about it later his other interesting things but he has this hobby he’s heading out in March to go sailing in the Caribbean again he says his management style is giving everyone enough trust to let them do what they’re supposed to do

but once some turbulence arises right? once we need to correct something he’ll promptly Uh as early as possible make that adjustment right? but before that trust everyone to do their work that is, believe in everyone to do what they’re best at yeah, I think that’s Yann’s role

he’s for this company on one hand a kind of spiritual leader but on the other hand also navigating the open sea you need a helmsman he also has this captain identity right and but I think what I feel about him I think what truly makes me feel I really enjoy working with this person

is more personal reasons we’ve talked a lot these decisions aren’t purely logical ones sometimes it still comes down to whether you click Mm-hmm it all comes down to people it all comes down to people right like Yann, even though he really is a big shot you’ll often see him at conferences holding out his phone

taking selfies with everyone taking group photos and privately he’s also a pretty pure and warm person right and being around him mainly I don’t feel any sense of fear even though he’s accomplished and distinguished mm-hmm, and then I won’t worry that I said something wrong and upset him

I think that’s actually quite rare especially given his status and standing to be like that and I can, or rather including everyone in this company can very directly tell him this is how I think about this I think you’re right, or I think you’re not right but let’s discuss together what way to move forward

that would be best for this company I think that’s also truly very rare right Tell us about your progress so far in terms of capital and team development of course by the time this is released it’ll be after your announcement uh, yes right, uh I think in terms of capital

Uh there’s no way around it my world model isn’t sufficient to support making that kind of prediction but our target might be around one billion dollars right if that turns out to be wrong we’ll just have to cut it [laughter] [laughter] [laughter] in terms of team composition

we’ll have many great partners like-minded people joining this company together so we’ll start with around 25 as an initial team mm-hmm, and we hope to gradually grow the team we don’t want to go too fast but not too slow either and in this there’s actually so much I think I think that’s part of the magic of building a startup

because before, at big companies I would also, uh refer some friends from the past my students to join the company together but it was never really a unified thing everyone went to different teams and did their own thing but but after starting a company I find you can truly bring everyone together

Oh and find a shared mission like this Mm-hmm I think that’s just so fascinating Mm-hmm and honestly I’m very moved by this myself because we have several friends who actually have tens of millions of dollars in unvested OpenAI stock if they were leaving OpenAI and also, say, at Google

there are also several like this Uh not at Google at Meta there are also those 15 to 20 million dollar offers like that and everyone just, without even thinking gave it all up to join us Why? I think maybe we’re all just a little crazy [laughs] it seems like

the thing is, you need to consider, on one side is research and on the other side is financial outcome right, of course I think if a startup ultimately succeeds the upside can be very significant mm-hmm, financially at least for now I think most people are still very mission driven right and everyone still believes

this is the only place where we can do this Have you already started thinking about business models? Uh I think the reason for raising this much money might be partly to reduce some of that pressure but of course this is a serious company so our CEO and COO spend a lot of energy every day thinking about

business model matters Mm-hmm right and, oh can I go back and talk about Yann again? Sure! oh right we’ll see how to adjust it later but I think what I just said this thing about having a compatible spirit is really not a commercial decision at all right, and then I think

mm-hmm, consistent with your mystical style of decision-making ah, of course of course the consideration is for example at the same time I would have had other opportunities too those opportunities might also have had much better short-term financial returns Mm-hmm higher salary, higher returns

but the way I’ve always thought about it is some people advised me go make money for two years first once you’ve made enough, come back and start a company — isn’t that better? Mm-hmm I partly agree, but I also worry right, at my current as at this stage of life do I still have two years in a good enough mental state

to do this fully exploratory research Mm-hmm I think that’s hard to say it’s possible that once you have money your lifestyle will change [laughter] and then this might also cause you to lose some of that original courage Oh and I think this is just for me personally

I have many, many friends right now who are at Meta especially at Meta right, everyone is actually making a lot of money they’re also very competitive they work overtime every day too and basically everyone has moved near the office working overtime every day seventy or eighty hours a week Yeah

I think I also believe they will definitely build a great frontier model but I also want to say to them when you finish building that model mm-hmm, come check us out [laughter] I think yeah hopefully it’s not too late but I think everyone I know they all have this sense of mission right

Meta FAIR’s hiring strategy is it aligned with your hiring strategy? uh, definitely not we don’t have the money to hire like Meta FAIR does definitely different mm-hmm, right or like Thinking Machines (the frontier AI lab founded by former OpenAI CTO Mira) including xAI I think they’re all very different right, I feel

although in terms of fundraising scale it’s actually pretty good right at least in the top few historically, right? top few — what’s the valuation? I don’t know, I don’t know Valuation we haven’t changed still 3 billion pre-money right [laughter] mm-hmm, but the money is actually not a lot

right, this capital is still very, very precious it’s not like being at Meta at Google you really have a money-printing machine there you can’t just print money it’s okay, you can do whatever you want I think in a startup we still need to be very, very careful in how we deploy resources I think you deliberately chose not to start up in Silicon Valley

is that right? uh, yes I think Silicon Valley again it’s very complicated people often say that it’s already deeply mired in already hypnotized by Large Language Models [laughter] and I think I think Uh but I don’t think this state of affairs will last very long

people who are hypnotized will eventually wake up and I think at that point we we don’t rule out at all setting up a company in Silicon Valley I think in the end or maybe very soon our company’s location will definitely be wherever the talent is that’s where our company will be having an office that’s a perfectly normal thing

Mm-hmm right oh well, let me go back to Yann for a moment Sure. [laughter] no, what I want to say is I think Yann one thing that really appeals to me is he’s truly a multi-hyphenate or rather a quite artistic person or in Kaiming’s words Yann is someone whose adolescence at 16

has continued all the way to 65 oh, that’s wonderful oh I think I think he must be pretty happy but he often says with great pride he has four great hobbies the first hobby is building model airplanes the second is astrophotography so on Zoom you often see behind the topic there’s a nebula, right?

a nebula-like wallpaper desktop background which he actually photographed himself in his own backyard and his third interest is making electronic music and getting into some jazz and things like that mm-hmm and if you look at his webpage it’s a treasure I often go look at it from time to time

he talks about which jazz clubs in New York yes, the better jazz spots which musicians are particularly good and he also says that generally speaking French people look down on American popular culture except for jazz so he talks about Charlie Parker and a whole series of people and how great these musicians are

I find it so interesting mm-hmm and he has another hobby which is as I already mentioned sailing so I think a person like this appeals to me actually very, very much because I think his world is actually very big his world isn’t just limited to research and now we’re going to build world models I hope, you know

the helmsman of this big ship is someone with vision and a love of life [laughter] and there’s another very interesting example coming up in March maybe when this show airs we’ll have another paper to release the paper is called Solaris Solaris (from Stanisław Lem’s 1961 novel) this is actually a sci-fi novel

a novel by Lem, and later adapted into a film by Tarkovsky and the reason we chose this name is because we’re building a so-called video generation model and the film is also about an ocean this ocean that can read the subconscious memories of people and ultimately materialize and generate things from them

I think that’s really fascinating of course in Tarkovsky’s film the message is our greatest enemy is not some alien civilization or the unknowable the ocean is actually humanity itself it is humanity’s own suffering and memories so the ocean is just a projection of humanity onto itself

I want to bring this up because I think this film parallels what happens with LLMs so closely I think LLMs may not actually be understanding humans it’s just a projection of humanity just a reflection but what I want to say is in relation to Yann one day I said to him, hey this paper of ours what do you think of this name?

and I wanted to see if he knew the film and he said, oh you know this is a film title, right? I said yes that’s exactly why I chose this name he asked me which version did you watch? [laughter] the 1975 one or the one from the early 2000s? I felt I found the right person

was it the Tarkovsky one or the Soderbergh one, right? and I said, OK I think, mm-hmm I don’t just admire you for your research it seems you also know more than me about film mm-hmm I think that’s one thing quite interesting might not matter to many people but it’s quite important to me personally

a reflection of personal charisma a Chinese investor once told me all startups born with a silver spoon none of them have succeeded almost none what do you think? Uh I don’t know what silver spoon means here enormous fundraising I see very famous as a founder who is already accomplished

and very highly accomplished Mm-hmm ah, we weren’t born with a silver spoon as I said, we’re completely I won’t say a ragtag bunch it’s a grassroots coalition startup model how could Yann LeCun be grassroots? Yann is not grassroots but in the AI industry right now or on the internet

including in front of investors often it’s half support half opposition half support, half opposition I don’t know what the exact ratio is but in any case he’s not the kind of hero everyone rallies around he’s someone who holds firm to himself and always tries to do the next thing but that thing hasn’t been proven yet

like that mm-hmm, right? and I think this means we weren’t born with a silver spoon we don’t have a silver spoon we don’t have that feeling at all I think we’re an underdog we’re underdogs we actually are surviving under a kind of industry pressure a company like that right?

that’s so humble-bragging no, no there’s no humble-bragging we may have raised a lot but compared to the resources LLMs are mobilizing now this is just I don’t know what percentage, it’s so far off Was it difficult to raise funding? with Yann on board it really wasn’t difficult right

but I I think a seed round is just a seed round I think you have to look ahead right? I think you have to see what comes next which is to say can we ultimately deliver on our mission can we achieve this research breakthrough I think that’s the most critical thing for us

but anyway I feel I really enjoy this underdog identity especially as an entrepreneur because I think it’s the same as being a researcher the more you don’t believe in me the happier I am Have you felt anyone not believing in you since you started the company? mm-hmm, I think many people a lot of investor feedback

more disbelief or more belief? Uh I don’t know what the ratio is we have many, many people who believe in us we have many people who don’t mm-hmm, many of our or in Silicon Valley most people don’t believe us in the rest of the world most people believe us so putting it all together I don’t know

Uh but that’s okay I think the thing I most want to see is right? you can not believe in us but then let’s see right, well I’m all in on this path now are you with me? Mm-hmm How do you think entrepreneurship compares to being a researcher? What’s different?

I think there are many similarities but also many differences mm-hmm, I think about entrepreneurship… do you ski, Xiaojun? I don’t you don’t? I don’t like sports I couldn’t ski before either but I’ve been skiing recently and I’ve gotten quite a lot of insight from it I think

first, skiing is a sport about balance once you master the balance you can actually ski second, you have to be fearless and point your shoulders down the slope I think this is so counterintuitive people are always afraid when you’re facing the downhill slope you always want to lean back Mm-hmm counter-instinct

yes, you go against instinct and once you follow your instinct you fall backward and you completely lose control and completely fall right? only when you completely abandon you only with enough courage and not fearing anything and pointing your shoulders toward the slope you actually become more stable

right? and you can actually control your speed better so there’s a quote I really like right, this it might be from somewhere from JoJo’s the anime JoJo’s Bizarre Adventure — it says the hymn of humanity is the hymn of courage I think that’s also my understanding of entrepreneurship I think it requires courage

but what you just asked is it the same in academia? I think it requires even more courage but many of the decisions I made in academia mm-hmm, I think were also quite courageous decisions right? and there’s also this saying I think you never walk alone mm-hmm there’ll be many people helping you

Mm-hmm and precisely because you have people around you you become even braver Mm-hmm you just mentioned your taste in research what do you think about your taste in people? First of all I don’t think you should have a “taste” in people I think having a taste in people seems like a condescending way to put it

Yeah How would you describe your ability to read people? let me rephrase but I think it’s also a mutual process mm-hmm, I think again, I think there’s a kind of attraction that brings together people who can work together and we just need to follow that attraction to find those people and be with them

right I don’t think I would of course there will be some specific these metrics we certainly have some like we’re conducting interviews now I can’t just say you don’t need to interview mm-hmm, I have a set of mystical logic for hiring that’s not realistic either Mm-hmm

but I do care about Yeah certain things I think I care about whether you truly have that kind of desire to solve a problem and the courage to want to understand something and that kind of persistence I think this matters for research and is also very important for entrepreneurship and when I recruit students

I also need to be able to see this kind of personality in people Mm-hmm [laughter] so this what does it actually mean? from the perspective of doing research it means if you have a problem in front of you right now Kaiming told me this too he said you should be thinking about the problem when you wake up

thinking about it while eating thinking about it in the shower maybe you can stop thinking while sleeping or maybe you even sleep with it on your mind do you truly have that kind of passion right? that drive to keep thinking about this problem or are you just treating this as just a job I think

I think it’s something that distinguishes people from one another a yardstick Do you have that problem right now? Yeah What kind of problem? mm-hmm, the kind of problem you carry with you every day yes, absolutely of course but my current issue is that’s also why I feel uh, in

after spending a long time in academia it gets a bit difficult because in academia, functioning you need to do all kinds of what we call context switching you need to switch contexts, right? because you have so many parts to manage and coordinate I think being in a startup is actually quite good I can now focus on one thing

I can think about what kind of team we should build what kind of people this team needs what problems we should solve in the next 1 month, 3 months, 6 months or a year Mm-hmm I might not be thinking about this correctly but that’s okay as long as the entire team works together we can fail together

pivot together then I think this company won’t fail I can’t guarantee every plan I have now is correct I don’t think Yann can guarantee that either Mm-hmm but I still believe in people as you said I still believe that gathering these people with ideals and passion who want to forge a new path together

will definitely achieve something remarkable Did you agree on the spot? LeCun? no, no, no there was a long, long gap in between and Yann wasn’t the first to approach me anyway, later Yann took charge of recruiting the team so he also had to think about what role each person should have right, I think later we discussed together

negotiated together and I think it was quite a long process and I think everyone eventually found their right place How long did you agonize over it? from the first time he told you maybe about a week of agonizing What were you agonizing over? whether I should start a company at all to do this

whether I should do this with Yann Mm-hmm or maybe look for some new opportunities mm-hmm, right? and then later but I didn’t agonize for very long right, I feel I thought, OK Yann used his magic I’ll tell you all talking to Yann is kind of like he’s a bit like

it’s like he’s casting spells like Harry Potter casting some enchantments on you mm-hmm, he says some things [laughter] and you stop thinking about other things mm-hmm, what spell did he cast on you? nothing, really he just shared his vision he just explained why this was a better choice

a better choice for me and also a better choice for this company why here I can have enough agency and autonomy the so-called ability to make independent decisions and build a team and help us design this entire execution roadmap I also incredibly, incredibly grateful so grateful that Yann could give me that trust

right but our company has several other co-founders everyone is really, really wonderful there are 6 co-founders in total oh, that many Yes and there’s a CEO what else? there’s a CEO right there’s also a COO there’s a COO right and there’s also

VP of world models and then there’s also whose current temporary title is CRIO who is also Chinese by the way, her name is Pascale Pascale Fung What kind of position is that? Uh it’s more of something between research between pure research and product a role at the alignment layer responsible for our innovation

she also has a lot of entrepreneurial experience Mm-hmm and our VP of world models was the original JEPA team’s uh, this so director, Mike and the COO was formerly Meta’s VP for all of Southern Europe Mm-hmm roughly that kind of combination so definitely not a purely researcher-background combination

Mm-hmm Will you explore consumer-facing products? uh, yes and the ultimate goal will definitely include a consumer-facing product but we hope we won’t be under any pressure because we still want to first build this world model however you define it first make it happen How many years out can your roadmap realistically plan?

planning years out is unrealistic I think if we can plan to a year that’s already pretty good right and I think we don’t need longer-term planning Mm-hmm Can greatness not be planned? uh, yes it’s just, I’m not it’s just like doing research I think you need an exploration process

start by exploring start doing things mm-hmm, then gradually find your own ideas I think this applies to startups too What do you think about where your ideas have progressed to? I think we’ve reached the point where we now have things to work on and we also feel there will be some quite promising results coming soon

that’s where we are but this thing what specifically? we can talk about it in a few months but coming back to it the thing is people outside have a misconception about this company and another misconception about Yann people actually don’t know what JEPA is mm-hmm, right [laughter]

I personally also went through several phases from doubting JEPA, to understanding JEPA then to becoming JEPA those three life stages Mm-hmm [laughter] I think this is also quite fun because at first, doubting JEPA was because we had just started doing self-supervised learning doing MoCo, doing MAE and I think

JEPA seemed like yet another self-supervised learning algorithm that’s it — then gradually understanding JEPA was because I felt JEPA actually goes deeper than we imagined there’s a lot of underlying logic inside it many mathematical principles and we also need someone on this path to keep persisting because what we discovered early on

couldn’t be scaled up so we stopped mm-hmm, and then but later with JEPA for example including me to give a simple example recently there was a paper called LeJEPA and with a very rigorous proof they showed if you want a good representation if you want this representation to be agnostic to your downstream task

then it must be an isotropic Gaussian distribution this is a bit technical essentially it means it’s a characterization of a certain property of representations and I found this actually has merit truly becoming JEPA is because I feel JEPA is not a model JEPA is not a specific algorithm JEPA is a complete cognitive architecture

it’s a cognitive system this in Yann’s 2022 paper is what he wrote about so in my view, this cognitive system is a path to intelligence a universal intelligent agent’s in my current view a very reasonable path so what JEPA requires JEPA is not just self-supervised learning it needs world understanding capability

it needs the ability to understand the world it needs the ability to make predictions it needs the ability to do planning mm-hmm, right? prediction and planning right I think this gave me new insights into JEPA and I found that JEPA actually isn’t a specific as people outside tend to say like Yann has this method

and he must stick to this method and turn it into something specific it’s not like that JEPA is a very, very vast ocean in this ocean there can be many, many ships sailing on it sailing [laughter] ultimately this entire system will have a lot of collaboration and LLMs are also part of it

Mm-hmm so this makes me feel, mm-hmm this company can succeed and has a great chance of succeeding the reason is it’s not about shrinking things down under today’s LLM settings everyone is narrowing things down but Yann’s company is deliberately thinking big mm-hmm, he has enough space for us to explore to let us scale up

until the very end we can achieve some kind of new breakthrough when exactly will this happen will it happen we can’t predict but I feel this is a path I’m willing to invest my life in to walk How does it feel after starting the company? Your genuine feelings it’s gotten busier and more tiring

it’s gotten busier and more tiring of course, definitely mm-hmm, there are lots of ups and downs there’ll be a lot of tedious things but also because watching this company grow bit by bit watching some because we have 4 offices with so many legal issues whatever so much internal friction

slowly, what was originally internal friction gradually becomes smooth that process is actually quite enjoyable and in that process we also received help from many, many people so looking at it so far I think I made the right choice Mm-hmm maybe a bit different from your expectations? maybe more optimistic

Mm-hmm right, I feel the moment you jump, the fear disappears mm-hmm, right I think as long as you have courage everything else is manageable and I feel in this company Ah I can find that courage Mm-hmm You just said AGI is a false premise can you elaborate on that? AGI is a false premise

this is also something Yann often says didn’t he have a debate with Demis (DeepMind founder)? right, he asked what exactly is general intelligence does general intelligence actually exist? I won’t go into too much detail on this but his logic here is also very mathematical very Yann what he says basically comes down to it means

this person for example, there are 2 million visual nerve fibers mm-hmm, this can be modeled all the possible visual functions are actually enormously vast it is as many as 2 to the power of 2 to the power of 200 functions but what humans can actually process and perceive is actually approaching zero

right? we are limited by our consciousness we are limited by our own neural bandwidth limitations we cannot see everything that happens in this world Mm-hmm so human intelligence is a very specialized intelligence it can only humans can only perceive what they can see Mm-hmm

and later I also added a tweet about it I read a book called “Are We Smart Enough to Know How Smart Animals Are?” which asks whether we’re smart enough to know how smart animals are Mm-hmm and after reading this book I let go of more of that human arrogance I think the evolution of intelligence is a continuous process

it’s not one where humans are truly unique right, we often say humans are intelligent because humans use tools but animals also use tools and some people say humans actually have a certain self-awareness and consciousness one laboratory said humans can look in a mirror and recognize that the person in the mirror

is themselves and not another entity can dogs? they can too right, many animals can oh, right? because some animals can’t dogs actually quite enjoy looking at themselves in mirrors [laughter] right anyway, many animals indeed can’t but many animals can mm-hmm, right?

and there are also many very interesting things like chimpanzees, right? and this author so de Waal also wrote another book called “Chimpanzee Politics” (a 1982 classic of animal behavior) which is about four chimpanzees and how they engage in power struggles very much like House of Cards or how there’s a lot of scheming

how you form alliances then maneuver and rise to the top and so on I think that’s very interesting [laughter] and one thing that left a deep impression on me was that for example, they these animals actually including chimpanzees, also have a kind of theory of mind they can also have their own world model

and their world models are actually quite good for example, there’s an example where an experimenter is in a room with two boxes one box containing a banana the other containing an apple the chimp is shown this then the boxes are closed [laughter] and the experimenter takes the chimp out after a long, long time

it’s brought back into the room and the first thing the chimp notices is that the experimenter is eating a banana and the chimp immediately goes straight to open the box with the apple and eats the apple without even glancing at the banana so chimpanzees also have a kind of reasoning ability

right? and although language is indeed unique language is something only humans have but that doesn’t mean other animals don’t communicate if we they have their own language they have their own language including like whales also have their own language anyway this is all quite fascinating I highly recommend that book

[laughter] and there’s also I read about some kind of bird (scrub jays) I forgot what they’re called apparently they’re very good at if one is burying food burying food underground if it notices that one of its peers saw it happen it will first bury it there then wait for the peer to leave, dig it up

and rebury it in a different spot I think that’s quite interesting and of course we also know dogs have a keen sense of smell and bats navigate by hearing I think the boundaries of intelligence are very broad people now talk about jagged intelligence so your world model which type of biological intelligence will it aim for first?

the goal is of course human intelligence human intelligence is certainly still at least in one dimension still the strongest or it’s also what can most benefit the world Mm-hmm so we still want to build a world model toward human-like intelligence Mm-hmm but I just want to let go of human arrogance

and recently I’ve been very inspired by this because I watched Rich Sutton in this podcast talk about a theory because before I didn’t know how to address this because people say LLMs are amazing, right? LLMs can now write code can win gold at the IMO and IOI can help us go to the moon and Mars

these things are incredible and I can’t deny these things they really are impressive right? but Rich Sutton’s reply I think was very good — he replied you think these things are great and impressive? that they’re hard? well, feel free to think that because I don’t think so I think building the intelligence of a squirrel

is the hard problem once you have a squirrel’s intelligence once you can build a squirrel’s intelligence and make it survive in the real world with its own goals its own objectives its own intrinsic rewards as you described it knows hunger it has its own emotions and it can engage in social activities

after that, writing code, going to Mars, going to the moon those things would be the easy ones Good I’m gradually coming to strongly agree with this view if you set aside human arrogance building a squirrel’s intelligence is actually a harder problem but that’s not how it looks to humans from a human’s perspective

it doesn’t seem that way but that’s entirely due to human arrogance you’re also building human-level intelligence ah, yes but what I mean is human intelligence has many, many aspects human intelligence is not just a language model human intelligence encompasses many types of intelligence that cannot be defined by language models or language itself

right, I think that’s a core insight What is your definition of intelligence? mm-hmm, so as I was just saying Rich Sutton talked about this he feels that squirrel intelligence is the real intelligence I think his framing is a bit different he’s not positioning from a human perspective looking at things from an anthropocentric view he’s standing at the universe

and the creator’s perspective from this angle of course being able to recreate a squirrel is greater than your human civilization in these 530 million years the things created in the last 8 seconds by far in this sense I think that’s elevated the discussion I think that elevated perspective has merit

but defining intelligence I don’t want to give it a definition I think different animals have different intelligence and humans have human-level intelligence Mm-hmm and what I want to encourage everyone to do don’t only focus on what we as individuals cannot do pay attention to what we’re already doing well pay attention to what a 4-year-old child

or a child of a few years old does very well those things are actually what our world model next needs to focus on solving mm-hmm, so this is also why Robotics is ultimately a very fitting outlet because before you talk about AGI or super intelligence can we first have a sufficiently reliable and general robot

that can function in our home environment and help with household chores right, because a few-year-old child can actually do many, many household chores there’s actually a list you can search for it online a 12-year-old child can basically do all the household chores but is there a robot right now that can function like a 12-year-old child

and handle these chores? of course not Jie Tan from DeepMind Jie Tan he also says that robotic development is extremely uneven extremely imbalanced its developmental trajectory compared to a child’s is different mm-hmm, for example the physical capabilities of robots’ limbs have now surpassed they’ve already surpassed humans

Mm-hmm but many other capabilities are still not as good as a child’s because of the brain nobody is building the brain nobody is building a robot brain all the robotics startups including the robotics divisions at big companies haven’t solved this Doesn’t DeepMind count? DeepMind is now entirely based on Gemini

so it’s also working within the VLA framework Yes everything converges to Gemini Oh but this needs a second half of pre-training Mm-hmm in Shunyu Yao’s classic formulation [laughter] I think there needs to be a second half but I think this is the second half of pre-training Mm-hmm Jim Fan recently also expressed the same view

so this pre-training is the world model who will do this pre-training? that’s not clear to me if I knew there was another place that could also do this then I might actually reconsider I wouldn’t necessarily need to be at this startup doing this myself right? robotics startups have no energy to do this

they need to put their resources into the so-called hardware scaling law that is you need to buy more robots to deploy these robots or do these things in simulators these imitation learning approaches that can give you a good enough to solve some specific problems in the short term a robotics team that creates value

What about PI (Physical Intelligence)? VLA, right? PI is the same PI is already very, very research-oriented and doing very, very well and is inspiring as a company but again, they won’t do pre-training they won’t do pre-training they’ll take language models as their foundation

Yeah right? How should we understand your second half of pre-training? what goes in what comes out I don’t know at least the first step is in the long run the inputs are all continuous-space signals as I just described high-dimensional potentially noisy signals Mm-hmm

at first it might still be video but we might also have multi-modal encoders to handle different signals beyond visual and the outputs that’s a research question the self-supervised question is still unknown I not necessarily unknown but it may become clearer later Mm-hmm

but I think it’s definitely not that simple but I think that’s where the excitement lies I also find it quite interesting because the first time we met you said “you are not the chosen one” “you are just the normal one” why do you like saying this? No you see, throughout our conversation we discussed my

growth story I I didn’t expect we’d talk about all this but I definitely don’t feel like a chosen one [laughter] this quote is actually from a team I love Liverpool, right? I’m a KOP (the famous terrace at Anfield and symbol of devoted Liverpool fans) for over 20 years [laughter]

I think there’s a bit of compatible spirit and my favorite manager Klopp Jürgen Klopp [laughter] he was half-joking when he said to everyone when another manager José Mourinho said “I am the special one” I’m the special one then Klopp said “I’m not the special one”

“I’m the normal one” and I think on one hand he himself is very punk he has that rock ‘n’ roll spirit [laughter] Uh and he often tells everyone that his role in the team is like a battery he hopes through his own passion and his own energy, you know to let others

generate electricity for others empower empower others mm-hmm, right I also want to be that kind of person I also want to be for a team whether that team is in academia or in a startup, a battery I think this is actually not easy because sometimes everyone has their moments of discouragement

Mm-hmm I also want to so complain more and let out my feelings but I’m gradually coming to feel in academia, like in front of students and in front of the startup team someone needs to play that battery role or I think Yann is a giant battery he inspired me but I hope to pass this electrical charge through me

and send it further What was the last time you felt discouraged, and why? I feel discouraged every day I think it’s become a kind of researcher’s fate I think everyone has this underlying melancholy because the process of research inquiry is like groping around in a dark lightless place Mm-hmm when you can’t see any light

you always feel lost and discouraged and when people truly feel this kind of joy it’s only when you actually get something working but this part of the time is very, very brief maybe only 5% or 10% Kaiming has said something similar so over time right, eventually everyone’s mental state can become concerning

but I think it’s okay I think Uh I think this era now is still not quite the same as before I think now there’s more discussion I think this is one of the benefits of this AI wave at least people won’t feel like they’re in a closed space exploring alone at least people can scroll through Xiaohongshu

scroll through Weibo, Zhihu and see how everyone is discussing this I think that’s sometimes quite stress-relieving but sometimes it also adds pressure when people criticize you, you don’t think that anymore Does your company have people with an entrepreneurial personality? entrepreneurial personality generally quite optimistic I think Yann himself is very optimistic

very, very optimistic why isn’t he a researcher with that melancholy undercurrent? hmm, I don’t know because he’s been through hardship and then succeeded Oh he lived through the AI winter and then showed everyone he was right and they were wrong if I went through something like that

I might not be so melancholy either he’s still quite optimistic I think or rather, his past experiences have also given him more confidence and something he often says is this what happened before with deep learning neural networks is exactly the same which thing? it’s that now, world models

or whatever you call it the current systems building intelligent systems now he says there’s always a small group of people who can clearly see the trajectory of the world’s development the progress of technology but they’re only a small minority most people can’t see it right because most people are busy doing other things

back then with deep learning people were doing whatever other things traditional machine learning mm-hmm, and now what you’re doing is you can, mm-hmm let’s not say it — think about it [laughter] and I think he’s actually quite optimistic or rather he has enough confidence

and says the things I can see are important things the path I can see is a clear path and on this matter I still believe him quite a lot Have you ever doubted him? Uh as I said I questioned JEPA then understood JEPA then became JEPA so of course there was doubt

but I feel that trust in a person and trust in a research direction takes time I was just telling students the other day every time Yann gives a talk he gives exactly the same talk his slides are honestly pretty ugly [laughter] [laughter] but they have his personal style style and design

is also interesting some things are originally ugly but if you use them enough and time passes they become the new fashion but every time he gives that same talk I’ve been feeling this very, very strongly recently I said this talk I’ve watched it at least 10 times 20 times now, but each time I get something new

every time I feel like I understand a bit more what he really means and this this deeper understanding is not because I’ve watched the same content 10 or 20 times and got this new understanding it’s because I’m doing what I want to do Mm-hmm and I find that is when watching his talk

each time I do this translation work and association work I find that what he said in my current understanding can be interpreted this way and it doesn’t conflict at all with even today’s large language model or multimodal paradigms everything Yann says can be clearly mapped onto what we’re doing now

concretely and guide us to perhaps escape some local optimum [laughter] and perhaps lead to a different future mm-hmm, so it’s become an inspiration right? it’s not just knowledge it’s an inspiration Mm-hmm so I think that’s also wonderful Mm-hmm we just talked a lot about world models

do you have any new thoughts on your world model for the real world? In the past year or two I think this thing must definitely go beyond the limitations of research the limitations of being a researcher it must enter real life and understand what’s happening in the real world but I think New York is very different

I go to work every day first, I don’t have to drive so I’ve already started to emerge from a kind of armor and enter real life by walking this I think has many wonderful effects for example some days I’m still under quite a lot of pressure sometimes something happens

and it’s quite discouraging but every time I walk through from my home to my office at school there’s a park called Washington Square Park Washington Square Park [laughter] inside there are all kinds of people all sorts everyone living their own lives there are street performers playing piano dancers

mothers pushing strollers old men playing chess and young people sitting on the steps doing nothing daydreaming and NYU students studying with laptops [laughter] I think my most stress-relieving moments every day are this roughly 5 to 10 minute walk I find the world is much bigger than we imagine not everyone cares about what AI is

they may not care about this at all and they have their own lives the world is big but on the other hand maybe AI someday in the future will indeed affect their lives so what should we actually be doing? as researchers do we have some kind of social responsibility? but this might be getting a bit far-reaching

but I just feel more contact with people more contact with people living in this world helps me understand what AI is and how to build the next generation of AI in new ways and this is exactly what Ilya wanted to talk about when he called me what he wanted to discuss but I hadn’t arrived at these insights yet

Have you picked up any new hobbies? New hobbies In New York? right no real new hobbies I think skiing counts as one most other times I genuinely don’t have time but the nice thing about New York is you know that once you go out you can find a new hobby that itself

is enough to make me happy whether or not I actually have time to step out and do those things Mm-hmm having that possibility available I think is quite different and very different from the Bay Area Can you share aside from work what music you like books you enjoy films and games you enjoy?

Right now Yeah that’s hard to think about off the top of my head I’m not sure I think let me approach this through AI let me think about what I’ve watched recently let me think Mm-hmm I actually enjoy watching TV shows so I can recommend some shows for everyone Mm-hmm

there’s a show called POI it’s also quite an old show Person of Interest I watched this many years ago in it they discuss what a super intelligence is you have a good super intelligence and a bad super intelligence their competition and the threat to human society and I think I won’t spoil it

but it’s quite multi-modal and this might have a certain prophetic quality I think it’s quite remarkable mm-hmm, right at its core it’s about how an AI in a box a language model or an agent that can write code step by step breaks free and becomes a multi-modal model

I think everyone should check it out and later there’s also something I really like like Pantheon (American animated series) it’s also I think a kind of AI prophecy yes, it’s an animation its author is Ken Liu (Chinese-American science fiction writer) he’s also from my hometown and he’s also someone who

worked as a lawyer worked as a programmer and ultimately became a novelist like that incredibly impressive I admire him greatly and I love reading his books too right but this show was also recommended by Sam Altman before so many people have seen it and also recently of course there’s this very popular Companion

called I think this is also a kind of AI prophecy the slightly troubling thing now is popular culture has been too saturated with AI making everything seem AI-related it’s a bit overwhelming but as maybe it’s just because I’m an AI professional so sometimes it feels different but I think

these things are still quite inspiring including the sci-fi novels I mentioned including these older films I think they may all be a kind of prophecy about reality but generally speaking these works of film and TV don’t point toward a very bright future usually the endings are quite bleak Mm-hmm

ah, I recently watched a film I think it’s called No Other Choice which might translate as No Other Choice a film by Park Chan-wook and it’s also about AI’s alienation of humanity throughout the entire film it never mentions anything about AI until the very end but the whole thing is about the changes brought about by AI’s arrival

what changes humans have undergone people’s mindsets relationships between people what exactly has changed I think these things are also instructive and speaking of one last word on films welcome everyone to come to New York in New York I used to attend one film festival the New York Film Festival

with many films to watch now I’ll be going to two the second one is the AI film festival Runway holds every year and I think it’s very cool and interesting if I were to recommend one very relevant to everything we just talked about one that won their grand prize this year the AI film called Total Pixel Space called

in Chinese it might be called Total Pixel Space [laughter] I won’t spoil it anyway this is a very interesting AI short film and it actually talks about a lot of what we just discussed about world models or why human intelligence is not simply or is not purely general intelligence

some arguments I think it’s quite fun mm-hmm, each of our guests recommends a life-changing book to our audience one that has truly influenced you and changed you what would yours be? a book? mm-hmm that’s hard — you have to let me think Mm-hmm one book I guess people often recommend

but the reason this book changed my life I wouldn’t say it changed my life hugely but it was during my undergraduate years a collective memory everyone would read this book called GEB have you heard of it? which is Gödel, Escher, Bach the Chinese title is “GEB: An Eternal Golden Braid” it talks about philosophy

mathematical logic and these three people, right? Gödel, Bach, and Es- cher, right? a mathematician a musician a composer and also a painter, mm-hmm how they are able to what philosophical commonalities they share you could put it that way right and it’s very interesting

because during our undergraduate days the book is this thick we studied it together as a group it was also recommended by our teacher so everyone studied it together and actually back then nobody really understood it but later it started feeling more and more mm-hmm, like it makes sense Mm-hmm this book I think

if you don’t have time to read every page carefully you can read an abridged version or some kind of summary some of its ideas I find very, very interesting and also there’s a book this one was probably also read during undergrad called Zen and the Art of Motorcycle Maintenance or is it motorcycle repair

“Zen and the Art of Motorcycle Maintenance: An Inquiry into Values” I think it’s called that right and this book is also a process of inner seeking it’s about a person riding a motorcycle with this might be a spoiler an imagined philosopher but this philosopher is actually a projection of himself

mm-hmm, my feeling reading this book was I also didn’t fully understand what he was saying right, mm-hmm but some books and films fill you up and some books or films empty you out my feeling after finishing this book was it kind of emptied me out Oh~ and it made me feel Mm-hmm right, this gets abstract again

anyway, it made me feel Uh it made me sense what truly matters in this world what doesn’t for you what matters what doesn’t I don’t know I think I’m always looking for that balance I think, mm-hmm I think genuine communication between people is important

perhaps nothing else matters but at any given moment if you ask me this question I might say entrepreneurship is important research is important but at the end of the day I still believe that communication between people is what matters it sounds like you want to do research also for the sake of connection uh, yes

I think so and I think research itself is also a form of deeper connection Mm-hmm Mm-hmm this actually helped us during fundraising too why not? an investor was very willing to invest in us and his reason the reason was someone he knew, a very strong entrepreneur

who is also a researcher and this person said, hey you absolutely must invest in Saining and whatever way we need to help him but I only met this person once at a meeting who was it? and later who? Uh Who? Robin Robin Rombach he’s the

first author of Stable Diffusion and the current CEO of Black Forest Labs Oh right Flux, right? [laughter] so the investor told me the reason he did this is this kind of trust is built on your academic work this trust can sometimes even surpass genuine personal

connection Oh people get to know you through your work and this carries forward and can go very far What do you think of Seedance? Seedance is incredibly impressive Seedance really let let our film crew today also say something about it I think it’s extremely strong

[laughter] I’ve heard it’s also a very, very large model and it’s a MoE model I don’t know if this rumor is true because before this I know nobody had been able to make MoE work within a Diffusion Model architecture if they truly managed to do 200 billion parameters and with an MoE architecture

and they were able to ingest all that data I think that’s incredibly, incredibly impressive Mm-hmm but all these generative models 90% is still a data problem architecture doesn’t matter much 90%, or let me say 95% it’s all a data problem mm-hmm, their data is inherently abundant their data itself is more

but volume alone isn’t enough Mm-hmm they must have done enormous work to clean the data to do captioning to calibrate the data distribution their diversity-quality balance as well as their prompt alignment with language the degree of that I believe a large number of people must have been involved in this work

and done an enormous amount right but once you’ve done all these things well subsequent things become much simpler but I think I think Seedance is very impressive I think including Sora including Veo wanting to surpass them I don’t think it’s necessarily that simple

Our studio is called Language and World Studio what comes to mind when you hear that name? what are you thinking? I see you wrote me a line: let go of uh, called let go of Wittgenstein let go of Wittgenstein well, that’s not a great way to end I’m going to start complaining again right, go ahead you complain — I say, let go of Wittgenstein

means you shouldn’t people shouldn’t take Wittgenstein and really stretch him using it as a language boundary meaning the limits of my world and use that quote as endorsement for LLMs or linguistic determinism so that’s completely absurd and likewise there are other quotes like people citing Feynman

Feynman said what I cannot create I do not understand this being used to endorse unified models I think both of these things are really unacceptable to me what’s the first thing? the first is Wittgenstein, right? when he spoke of the limits of language as the limits of my world there were strong preconditions

in his Tractatus Logico-Philosophicus what he discussed in the Tractatus was that your the language he referred to targets what can be captured in propositions the limits of the world that can be described and this does not represent the general the entirety of what we call the world [laughter] so first, the language he spoke of

and the world he spoke of are already different from the language in today’s LLMs and the world it refers to second, in his later period Wittgenstein had completely overturned his earlier entire philosophical system he later stopped saying that and what he talked about instead was language is actually a game the so-called concept of language games

meaning language itself has no inherent meaning these symbols themselves have no meaning the reason they acquire meaning is because they are connected to real-world practice and engaged with it Mm-hmm and this is very much the world model view that is we’re not saying that language can perfectly represent the entire world

what we’re saying is that the world’s practice the world’s actions determine the game of language its intension and extension mm-hmm, again I don’t understand philosophy I don’t understand Wittgenstein either but I just don’t like seeing in people’s papers opening with a pulled quote I think that doesn’t fit my aesthetic sensibilities

the Feynman quote is the same mm-hmm, he said what I cannot create I do not understand that quote itself is not wrong but the create and understand he’s referring to mean for example, we have a world we want to understand this world we want to transform this world we want to understand the world through transforming it

whatever the things he was talking about are still within a real, concrete world requiring some kind of action mm-hmm, even when you’re in class you go and make a PowerPoint you’re still engaged in a process of creation but now many people take this quote and use it to make this kind of, uh endorsement for some simple unified system

that’s logically untenable too we can’t simply reduce creation to a diffusion model its backpropagation loss that’s completely absurd mm-hmm, right? so I don’t know I think maybe it’s like when I was a kid overusing famous quotes in essays now seeing these things gives me a bit of PTSD

and I think as Kaiming said everyone should read more philosophy I think that’s quite worthwhile mm-hmm, at the very start you said you believe in fate and believe in it more and more where do you feel fate is pushing you now? Ah I don’t know is fate pushing me? it doesn’t seem like it I think

there’s no feeling of being pushed by fate mm-hmm, just mm-hmm, when the next time I need to make a choice comes I just hope for good fortune Is this world a giant world model? of course the world is a giant world model can you predict fate then? uh, I don’t think so why not? Mm-hmm because we don’t have enough resources

Oh you’d need a computer as large as the Earth or you’d need a computer the size of the entire universe to tell you the answer about life about the universe about anything and the answer might ultimately be 42