#73 LLM Hunger Games: The Ultimate Showdown - Rootsconf recap (Part 3) Artwork

DataTopics: All Things Data, AI & Tech

Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics is your go-to spot for relaxed discussions around tech, news, data, and society.

Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style!

All Episodes

DataTopics: All Things Data, AI & Tech

#73 LLM Hunger Games: The Ultimate Showdown - Rootsconf recap (Part 3)

December 26, 2024 • DataTopics

Send us a text

Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society.

Dive into conversations that flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style!

In this episode, we wrap up the Rootsconf mini-series with a thrilling finale with Sophie De Coppel and Warre Dreesen's workshop from our internal knowledge-sharing event:

AI Hunger Games: A showdown between AI language models like GPT-4, Claude, and Gemini. Who aced coding, games, and social interactions?
Human vs. Machine: Fun experiments like “Find the Human” and “The Chameleon Game” highlight where humans and AI shine—and stumble.
Model Personalities Explored: Discover why some models seem nerdy, others boastful, and how creativity plays a role in performance.
Engineering Insights: Behind-the-scenes on implementing and testing AI models in competitive scenarios, from advent-of-code puzzles to group chat debates.

Join the fun as hosts and guests break down the playful and thought-provoking ways we’re pushing AI to its limits. Let the games begin!

Speaker 1: 0:02

You have taste in a way that's meaningful to software people.

Speaker 2: 0:07

Hello, I'm Bill Gates.

Speaker 3: 0:13

I would recommend TypeScript. Yeah, it writes a lot of code for me and usually it's slightly wrong.

Speaker 1: 0:20

I'm reminded incidentally of Rust here, rust, this almost makes me happy that I didn't become a supermodel. Cooper and Netties.

Speaker 4: 0:31

Well, I'm sorry guys, I don't know what's going on.

Speaker 1: 0:34

Thank you for the opportunity to speak to you today about large neural networks. It's really an honor to be here. Rust Data topics. Welcome to the data. Welcome to the data topics podcast. Rust Rust. Rust Data Topics. Welcome to the Data Topics Welcome to the. Data Topics podcast.

Speaker 5: 0:46

Hello and welcome to Data Topics Unplugged Deep Dive, your casual corner of the web where we discuss all about the Hunger Games of AI. My name is Murillo. I'll be hosting this intro together with Bart. Hi hey, bart, I cut you off just before we started. Sorry, what were you? You wanted to say something.

Speaker 3: 1:07

That it would be very cool if we would have this, that thing that they have in the movies, to start a scene like the flappy thing that does the clap Ah of course. The numbers and the titles on there. I think that would be a nice prop to have. It would make you feel just a little bit more important.

Speaker 5: 1:24

Well, speak for yourself.

Speaker 3: 1:26

No, I'm just kidding, but I'm just the co-host right To make you feel more like an actual actor.

Speaker 5: 1:33

Let's make it happen. Bart, this is actually for people listening. This is going to be, I think, christmas will just have happened or will just be about to happen. I also checked, so Merry Christmas to everyone. Maybe this can be our Data Topics Christmas gift, the clapping thing.

Speaker 3: 1:53

It could be. I'll write to Santa. Let's see. It's time for the third mini episode today of our RootsConf interviews. Rootsconf is our annual knowledge sharing event, an internal event where we have a lot of our colleagues presenting ideas, projects that they did, the research that they did on a lot of different interesting domains. It's presented through talks, sometimes by a single person, sometimes by multiple people. What Murilo did is that, after these talks, that he dragged some people into the podcast room with him and then we've released these mini interviews a week at a time, and this week will be the third and final one. And what is this one about, murilo?

Speaker 5: 2:43

But before because I noticed that I haven't mentioned that I also presented one, and I is this one about marilo, but before because I noticed that I haven't mentioned that I also presented one, and I just wanted to, you know, share a bit of what I did okay, okay, okay, go ahead, go ahead um strike ups on armor yeah, yeah, I don't know, I just I just noticed that I feel like I'm talking about all these people.

Speaker 5: 2:59

But I also wanted to to share a bit. I thought I had a lot of fun delivery. So it was a workshop. You were there as well, bart. I joined the workshop. Yeah, maybe you can give me your feedback recorded in a bit, but I had a lot of fun building the workshop.

Speaker 5: 3:14

Delivering the workshop was very fun as well, and the idea was to kind of come up with two parts One well, each person basically gets a chat, gpt, and then you have to protect a password that is given by the system prompt. So the first part is like you have, you have to yeah, you can test stuff and you build your defenses. There's also like some programming that you can add to it if you want. And then the second part is that people try to capture each other's passwords, and I think the idea is also to bring a bit the experience experience like, okay, how reliable are these models, how reliable are not these models, what are some things? We can defend it, what is not, and have a bit of a competition healthy competition. I had a lot of fun building it. I learned a lot of stuff in building it as well, and I also had a lot of fun delivering. I don't know what you thought about it, bart.

Speaker 3: 4:01

It was really cool. It you thought about it part it was really cool. It was a bit of a. It was in teams to attack all our teams and you had a bit this uh, we were all in the same room, so that altered a bit to the effect um and uh, it was a bit of a gamification around jailbreaking, right. Yeah, it really gave people and a very intuitive feeling on what is jailbreaking uh, and, at the same time, really actively trying it out. Yeah, really cool, I set it up yeah, it was.

Speaker 5: 4:28

Uh, yeah, it was cool it was. I feel like he went by really fast. I wish I had more time, yeah, thinking then I still ran out of time. But uh, I think I usually run out of time, but that's not what we're here to talk about. What we're here to talk about is the gen? Ai showdown. So actually it was called the hunger games or ai hungry games or something by sophie, the couple and yes um, so what was their talk?

Speaker 5: 4:55

basically they, they had a, they had some games, basically, and they took the big lms I think gem, the Anthropic one, which I think is Cloud, that they use Cloud, sonnet and ChagPT. I think that was it. I don't know if that was the fourth one and basically they had some different games around it. So, for example, one is that they had all the models play the Advent of Code, which for people that don't know, is basically Christmas-themed coding challenges and they see what's the model that went the furthest there. They also had one that was like uh, there's a game called, I think, mr white. I want to say that, um, basically, each person gets a word and then one person gets a similar word or like a blank word, and then every person describes the like, gives one adjective above that word, but the person that doesn't know needs to make it up right. So they did something like this with LLM. So each LLM had like a turn to, yeah, describe it, and then, after a round or five rounds or something, everyone needs to vote who they think Mr White is. So they did it also with LLM, indeed, they also did one was find a human, also with LLM, indeed, they also did. One was find a human. So basically they had questions like again, like around the table kind of thing, and then we had one volunteer from that was watching the session to try to trick, you know, try to give a very chat, gpt like answer, and then everyone votes who they think the human is.

Speaker 5: 6:21

So a lot of like little fun games like that, you know. Um, that kind of highlight the different components. So, for example, the entropic models, what I also hear, and also my experience, but also what I see in blog posts and whatnot, that the entropic models are the best ones for programming today or the ones that look like they have the best results. Um, this was also the model that went the furthest on the advent of code, but it didn't do better on the other ones, right? So also it was a bit funny because talking to them it felt like some models, they had a bit of a personality of it.

Speaker 1: 6:57

Like.

Speaker 5: 6:58

OpenAI was a bit more show-off, like it would really say like, oh yeah, because OpenAI models can do this, this and this um, so it was a, it was, it was, it was. It was very interesting to to hear their insights here and there.

Speaker 3: 7:11

So it's very cool, very cool talk as well, let's go and listen, let's do it all right, thanks everyone merry christmas, happy new year, enjoy the holidays.

Speaker 5: 7:25

You have taste, in a way that's meaningful to suffer Alrighty, and the roots come still Now, with Sophie. Sophie or Sophie.

Speaker 4: 7:36

Sophie.

Speaker 5: 7:37

Sophie, my bad, so I feel like I've been doing this wrong the whole day. Sophie and Wache. Yes, how would you say Wache, wache. You scratched the R a bit, no, but I feel like it's a. It's a Belgian thing, but like Belgians, they all scratch the r. There's something. Some of them don't, some of them like kind of roll it like war yeah, it depends if you have or Flemish or Wallonian ah, yeah, okay, and isn't there like also the scratch, the r like, or you can roll it?

Speaker 5: 8:04

but also I've noticed some people that kind of say the r like this or maybe probably also difference between dialects. Yeah, indeed okay, but it's good, good enough. Yeah, sure, we'll keep working on that. Um, thank you all for joining. I think this is uh. Is it both their first? No, I think, sophie, you're. Last year the roots conf also recorded a short snippet, or no?

Speaker 4: 8:28

Yeah, I did the voice cloning About voice cloning.

Speaker 5: 8:32

Indeed, I remember I was there, I was paying attention, cool. So welcome back, and this is your first time on the pod. Okay, welcome. Thank you. Maybe I know you were there before, sophie, but for the people that didn't hear that one or people that would like a refresher, you know, update Sophie 2.0,. Would you like to introduce yourself for the people that don't know you yet?

Speaker 4: 9:00

So I'm Sophie. I've been at DataRoute for like two and a half years already, I think. I started at DataRoute as an intern freshly out of university and now I'm a fully-fledged engineer.

Speaker 5: 9:13

Look at that, can we get the applause maybe, or maybe the harp, like a metamorphosis kind of thing. Okay, no, never mind, just we can imagine that it happened.

Speaker 4: 9:28

I mostly specialized in ml, started like computer vision, but then quickly went to gen ei.

Speaker 5: 9:34

Uh, all sorts of gen ei by choice or by need well, it started with like the whole dali images generated so you were computer vision, and then there was gen ei computer vision, and then you got got hooked there.

Speaker 4: 9:47

No, in my free time I do a lot of artist hobbies, so painting and everything. So the whole generating images was a big thing, especially around artists. So I got into it with a project, also surrounding artists, with the Prismax, and then I went into text and voice later on. So, like I did, like the full circle and uh, was your internship?

Speaker 5: 10:12

was the style transfer? No, yeah, also a bit artsy yeah right, it was computer vision also in ai yeah, before dali, but then dali came out and completely obliterated my internship.

Speaker 5: 10:23

So well, but it's fine, it was an internship. I also um. Sometimes this happens, like even with nlp. They are like people that were doing research, professors that spend years and then lms come and destroy everything. Or the same thing. I heard it will happen with the um deep learning for computer vision. When it came out I remember even the professor q11. He was explaining. He's like, yeah, and we spent so much time doing apnc, um and yeah, then deep learning comes and it blows everything out of the water. You know it's like now it's llms or not lms, but it wasn't that time. It's deep learning everywhere. So it happens. I also talked to um. Another sophie from uh yeah, was from space. I don't know if space is a company anymore. The people from Explosion.

Speaker 4: 11:07

I met her on the meetup.

Speaker 5: 11:09

On the meetup that you also presented, indeed, and I was also talking because she also has a research background, right, and I share this perspective with her that, like when you do research, you kind of bet in one technology.

Speaker 5: 11:21

And you kind of become a very expert on one thing, but then if you bet on the wrong horse, quote, unquote then yeah, like it's not, like it's, it's not like it's. You throw it everything on the trash, right, but it's like it's everything. This is what's up now, right, like you, yeah, you specialize in one tool and turns out it was another tool that that won everything. So it happens. So for you, it was just an internship for some people with like years and years of work and and I remember what she mentioned, I got lucky.

Speaker 5: 11:49

Yeah, I remember what she mentioned. It was like ah, but we have to believe that we move the needle a bit. Of research, right, like we contribute to all these things, right Like, maybe, yeah, maybe no, that's not the winner because someone spent the time to invest in it, right?

Speaker 4: 12:04

Yeah, but it's also not like especially for her, it doesn't completely disappear, like some techniques of nlp can just still be used in the llm context yeah, true, I'm also wondering, I mean, how much can you translate topics right, like?

Speaker 5: 12:18

I'm not sure, because I remember for computer vision it was really like the, the filters, which was more like manual right, and I think, yeah, you can always reuse some of it right? I also think that the skill of thinking critically about problem and I don't know for computer vision, for example, the knowledge of the different convolutions and how to extract features and to understand that the images are just matrices and understand this and understand that you can play with images. I'm sure that a lot of it still translates right, but I wonder how much.

Speaker 4: 12:55

But very cool any um fun facts, anything, anything, any life updates since then. Uh, I did like the voice cloning for the mall.

Speaker 5: 12:59

Then after, yes, which was fun appearance.

Speaker 4: 13:02

Yeah, yes, uh, I think famous now I already did it at roots golf, but I couldn't talk about it yet ah, okay, so I had to keep quiet. And that's also why I did the talk with santa, because, like, we worked on the mall so we had the experience, but we like wrapped it around, like, oh, this is just a fun research uh, it was like.

Speaker 5: 13:21

Oh no, it was just friday afternoon. I just didn't have anything to do, so just play cool. What is it? Maybe for people that haven't watched the show or people that are not familiar? What is the mall and what did you do?

Speaker 4: 13:32

So the mall is a sort of team play where you have like one saboteur that they do different challenges and one among them tries to sabotage the challenges but they don't know who it is and they have to guess and the person that guesses the right person wins. And there was one challenge where we created different voice clones of the different candidates and then we said, like the voice of the mole is the only real voice in there.

Speaker 4: 14:01

So, they had to distinguish, okay, which one is like most real. But of course, yeah, the technology was already quite a band advanced, so they couldn't really tell okay, so you got him.

Speaker 5: 14:12

Yeah, okay, cool, cool, cool, yeah, and I think if people can, people still find it online if they look for those, probably. Yeah, maybe if you can share, maybe you can put on the show notes as well, very cool, and now for you back back to you in the studio. Sorry, it's fine, what is fine?

Speaker 2: 14:32

I'll do my best. Yes, so I'm water. I also joined data woods like two years and some months ago, together with sophie you also metamorphosized to a data engineer now yeah, I was a data engineer for a bit more than a year and then I got thrown into gen ei and now I'm here so now I got addicted now, you're not kind of hi, I'm Marvin.

Speaker 5: 14:52

I'm a gen ai addict.

Speaker 2: 14:53

Hi, what's up I do problem engineering for a living yeah, I talk to machines for all day.

Speaker 5: 14:58

Yeah, basically okay, cool. Um, any fun facts? Any?

Speaker 2: 15:03

yeah, fun facts. This is not my first podcast. My first, my first gen ai project I did was immediately for like a podcast for the aws session yes which was like I was selling myself as like an ai expert, and then the last thing I said was it's really easy.

Speaker 5: 15:21

I only started like four weeks ago and they all gasped yeah, so yeah, but I think also for jenny, there's not, it's not like yeah, if you're, you cannot have five years of experience with you, right, so, but um, it's cool and um, what was the for the aw? So it was, it was a podcast, but it was also the twitch live stream.

Speaker 2: 15:40

It was like podcast setting twitch live stream.

Speaker 5: 15:42

Okay, um, so it was for rag chatbots on aws so maybe again for people that we did talk about Rack before on the podcast, but for people that forgot what it is. What is it?

Speaker 2: 15:51

So if you ask an LLM a question, it doesn't know things about your company. So you put the things of your company in a database and then, before you give it to the LLM, you first query the database and you put those answers to your question so that it can look from the articles in the database and give a precise answer, fine-tuned for your company.

Speaker 5: 16:11

I would say Okay, so you did a RAC chatbot for AWS using AWS infrastructure.

Speaker 2: 16:18

Yeah, it was with some articles of the elections of VRT.

Speaker 5: 16:22

Oh, okay, and maybe AWS. What is the state of Gen AI and AWS? Because I know that, well, openai seems to be the big player. Right, openai is a partnership. I don't know if they're partially owned by Microsoft, but there's definitely a tight link between the two. Even there's the Azure OpenAI service, right, so still a lot of well, I think that still a lot of people, when they think of Gen AI or companies, even if they are on AWS, they still have Azure accounts just to use the OpenAI stuff.

Speaker 2: 16:56

It's quite okay. I mean, aws is fully ingrained with, like the Clouder models which worked pretty well. You had, like AWS Petrog, I think it was called where you had access to all the different models, but in the end we just used the OpenAI APIs and that also worked from AWS.

Speaker 5: 17:15

Okay. So what did you use AWS for?

Speaker 2: 17:18

For the setup and the search service.

Speaker 5: 17:21

Oh, okay.

Speaker 2: 17:21

We used it for the documents and the regs and they had services to connect to OpenAI. They also had their services to connect to open to open ai. They also had their services to connect to claudia. So it was kind of the same but different.

Speaker 5: 17:32

I would say okay so then the infrastructure was like the vector database and all these other things were only the guys deploying in an alanda okay and you work. You had to do some live demos on the twitch uh, yes, did you have backups? No was no. Was it a screen recording that you just pretended you were moving the stuff?

Speaker 2: 17:51

You can share. It's okay, we can stop recording. It was just. It just worked. It was nicely written code, yeah.

Speaker 5: 17:59

I don't put bugs, it's fine, it just worked. You guys don't have the same experience. No, okay, just make it work. The curse of the demo. Okay, it's fine. Okay, very cool, very cool, very cool, very cool. And you're both here at the RootsConf. Are you enjoying the RootsConf? Yeah, of course. Yeah, okay, cool, but you're not here just enjoying the RootsConf. You also presented, you also share knowledge, your knowledge, with people.

Speaker 2: 18:24

No, yeah, sort of.

Speaker 5: 18:27

I like how he's like oh okay. Well, what did you do?

Speaker 4: 18:32

We just had a fun presentation about LLMs.

Speaker 5: 18:35

Yeah, the title was LLM Hunger Games. Oh wow, right. Yes, what is? Yeah, the title was LLM Hunger Games. Oh wow, right. What is it about?

Speaker 4: 18:45

so we went over, we did a bit of the basics of LLMs, but then we quickly went into different kinds of games we created to just evaluate them a bit and let them fight against each other. And we even had a game where we put in a human so that there was some interaction in there and the human also had to fight for their For their life. To win.

Speaker 5: 19:13

Are they here still?

Speaker 4: 19:16

Maybe Dorian is still recovering. Dorian was the one Dorian is a colleague of.

Speaker 5: 19:19

So Dorian was the one. Dorian is a colleague of ours, Maybe Wadir. What were the games?

Speaker 2: 19:26

Yeah, so we started. Basically, we need something for foundation models. What is it? Let's make some games out of them. The ones I worked on was the Advent of Code, which will start next week. So I was like, okay, we all use LLMs for coding, how well do they actually do? So I found some APIs for Advent of Code that you can import your own puzzle data. So I just copy-pasted my assignments into the LLM and it gave me a solution that automatically ran and it just said the answer was 5,000, whatever, and you just fill it in and see if it works. And did it work? One model got to like day five and the others failed before that.

Speaker 5: 20:05

So but uh, and why do they fail? Like it's just because you think? Why would you say that they fail? Is it just because their models are not good enough?

Speaker 2: 20:11

yeah, they gave the wrong answer, but I think the advent of code is pretty clever and like they knew this would happen, so they put their assignments in like very vague long text format. So they tried to persuade you a bit and put it into a story. So it's harder for llms, I think, to understand the assignment okay, okay, okay, interesting.

Speaker 5: 20:30

And which was the model that won? Maybe?

Speaker 2: 20:32

it was the cloud model, the latest one, and gpt for row was, like always, nicely formatted, but it just wrote wrong code.

Speaker 5: 20:40

Yeah, I also, so I'm actually using Cursor. These days, cursor is like a VS Code fork for AI stuff. They also have models you can choose from. I have a feeling and also this is the general opinion that Claude 3.5 Sonnet, so it's the latest Claude model. It is better than ChachPT 4.0. That's what I For coding but also it agrees with the results you find.

Speaker 2: 21:14

Results are everything but scientific. Well, it's empirical.

Speaker 5: 21:20

If you did 1, 1000 games and claude is on, wins 90, 90 of them. That's. That's research. Right, like this dude's like benchmarking. They just try how much stuff and see who's who's on top we didn't have time for so much. I did five games, but you presented this well, um, okay, cool. So then it was like each model they were trying, and then you had dorian as well.

Speaker 2: 21:42

That was also trying some things yeah, for that one we had like a group, group chat setting and then we are like just talk to each other and try to guess who the human is between us so we had a chat conversation with different llms and then dorian also in there, but disguised as a player, and the chatbots had to find the human among them, and Dorian had to disguise himself.

Speaker 5: 22:05

So this was a different game. Yeah, it's different. So one game was basically a competition to see who can go further in the Advent of Gold. This was just LEMS and Claude won, yeah. And then you had this one that was like all of them are in a chat, um. And then you had this one that was like all of them are in a chat room and then based only on the text, right.

Speaker 5: 22:23

So I guess the latency is not a thing no just based on the chats, then, uh, they need to find who the human is. Yeah and uh, they could find them. Story yeah, yeah, they found dorian yeah very easily well not all of them which models did you use, by the way?

Speaker 4: 22:43

OpenAI, Gemini and Cloud. Openai 4.0 Cloud 3.5, sonnet and Gemini not the latest one, but the one before.

Speaker 2: 22:56

I forgot the name the latest one came out last week and how does it actually work?

Speaker 5: 23:00

one before, I forgot the name okay and the latest one came out like last week yes, okay and uh, okay and then. So how does it actually works? Like so, they are in a chat room, so each, each bot has a turn, or how does it work?

Speaker 4: 23:09

yeah it's. Each bot has a turn and you have like a sort of uh, general prompt that this guy that describes the game and it's visible for all LLMs, and then you have LLM-specific prompts on how their tactic should be. And then there is the group chat where you have all the messages that get sent to all the different LLMs.

Speaker 5: 23:32

Okay, and then how does the guessing happen? At any point or after five messages, everyone says okay, now you vote. Who's the?

Speaker 4: 23:41

Well, you can implement it in different ways, but how we did it is like five rounds of talking and then one round of voting.

Speaker 5: 23:49

Okay, and then that person gets kicked out. It's almost like the werewolf game A bit yeah, what's the goal?

Speaker 2: 23:56

a bit Something like that.

Speaker 5: 23:57

Okay, cool, and you developed this. What was the UI or something? Is it like a streamed app or is it a?

Speaker 4: 24:04

Well, we just ran it in Terminal. Oh good, so not super fancy. We built mainly upon an existing repo called Chat Arena, okay, but we had to implement some stuff ourselves, like the game, of course, but also like backends for Gemini.

Speaker 5: 24:23

I guess the reason I ask is if there's someone listening that wants to give it a try. Is there something that is possible or not yet, or not at all? I'm not going to work on this.

Speaker 4: 24:32

The general repo is available online or fork of it isn't, I think.

Speaker 5: 24:37

No, but maybe you can put the public repo as well in the show notes you need some work to get it working with your api keys.

Speaker 2: 24:45

Like we had azure open ai, so we had to have our azure account in there and that requires some work.

Speaker 5: 24:50

You have to know a bit what you're doing yeah, I see, I see, I see, and um how, how easy was it for the models to find those doors? It was like first voting. They all guessed. Which model did not guess?

Speaker 2: 25:00

doria the first, I think, open ai oh really I'm not sure oh, I don't remember.

Speaker 5: 25:08

I thought it was the second one, so claude claude, then yes so claude is like yeah claude is like a nerd that has no social interactions.

Speaker 2: 25:17

Uh, knowledge it was really weird, like Claude was really targeting Torian at one point it was like normally everyone asks like general questions and Claude was really, like you know, you, player four, what do you think of this? And then in the end he still voted wrong. So I'm like I thought he's really onto it, he's really noticing it, and then he didn't vote.

Speaker 4: 25:36

But the questions were also a bit like all over the place right, the temperature was pretty high, I think yeah, maybe I think I know what you mean.

Speaker 5: 25:44

But people you said like the temperature is really high, it's like, well, it's winter in belgium. What do you mean?

Speaker 2: 25:49

so when you have a model, you can give it a temperature that basically tells it how creative it is. So if you have it all the way at zero, it's almost deterministic. I would say it's always the same very dry, boring thing. And if you have it like all the way to the max, it's just, yeah, throwing stuff out, slightly coherent with it, but can get very creative if you have a very high temperature. I see, I see, I see. So for the spectacle we did put it very high.

Speaker 5: 26:15

I think um, maybe also so how many games you had? You said you found five, five, yep. I don't know if we have time to cover all the games, but like so you mentioned it was, I think, maybe also.

Speaker 4: 26:19

so how many?

Speaker 5: 26:19

games you had. You said you found five, five. Yeah, I don't know if we have time to cover all the games, but like, so you mentioned it was like the find, the human chat, the advent of code, which, by the way, for people that don't know what the advent of code is, if you're listening, this December there's a guy that he just puts one. There's like the advent calendar, so it's like one chocolate or whatever per day until Christmas. The advent of code is something similar. So every day there's like a puzzle and then, based on the puzzle, you also get like a text file or something. Then you have to manipulate the text file to get the answer. You submit the answer and then, as you do, you unlock other levels. Right, so that's what the advent of code is. So you can also have private leaderboards. It's quite a lot of fun. It's in the month of December. So, people, if you're listening now you can Google the admin code or we can put in the show notes as well. So coding challenges.

Speaker 2: 27:09

Chatbot. What were the other ones? I also had a look at an image, because models are multimodal these days. So what does it do with an image? We took one of the fun pictures of people wrestling in summer suits at one of our day with steam buildings. Okay, we asked them like describe this image and what does it tell you about company culture and stuff? Okay, but I would say the results were pretty disappointing. They were like all very dry, just explaining what happened and I was like expecting some funny answers or whatever. But okay, it was pretty. It was like hr, like the hr approved message that rodents like super exciting.

Speaker 5: 27:44

You should have asked the the twitter yeah, you know yeah okay, cool, uh, what?

Speaker 4: 27:52

so that's the third uh, we also just had like a simple group discussion where they could decide among themselves which is the better alarm oh okay, and did they agree, or something?

Speaker 5: 28:03

yeah, but it was because openai started, uh, promoting itself I like how, like these different models, they start to have personalities right they have the the clot is the nerdy one that doesn't talk to people, stays like in their cave. Very good programming but cannot tell anything about social cues. Then you have open. It's very cocky.

Speaker 4: 28:21

Yeah, okay, yeah it was actually funny because we put specifically into the prompt like don't vote for yourself, but opening. I immediately started to focus, just like.

Speaker 5: 28:34

Sorry, I cannot go against my nature okay, and what was the last one? Nature Okay, and what was the last one?

Speaker 4: 28:41

What was the other one?

Speaker 2: 28:42

I'm thinking.

Speaker 4: 28:43

And a chameleon.

Speaker 2: 28:44

It's basically the same, but without the human in the loop.

Speaker 4: 28:46

Yeah, a chameleon game, so one person.

Speaker 5: 28:48

One bot is the chameleon, and they have to find out who's the chameleon.

Speaker 4: 28:51

Well you have a topic that is secret, and each time they have to give a hint about the topic and the chameleon has to try to blend in, as it also would have known the word there's a game of Mr White or something, yeah it's the same cool and which was the?

Speaker 5: 29:10

what were the results there? Any insights? What was the best model? What was the most?

Speaker 2: 29:13

I would say Gemini was like really human, like, I think, in general for all the experiments it always sounded like the most human, but it was sounded like the most human but it was also like average at everything. So it was like a bit.

Speaker 5: 29:26

It was very human.

Speaker 2: 29:27

It didn't have any strengths, but it was like very natural. It didn't sound that much like a robot.

Speaker 5: 29:33

Cool and the presentation. You did this all live or no?

Speaker 4: 29:36

Yeah, we did live coding, but I did have video backups.

Speaker 5: 29:41

Okay, this all live or no? Well, yeah, we did live coding, but I did have video backups. Okay, cool, nice, smart. Yeah, yeah, um, but all these experiments you showed like live and all these things, okay, very cool, maybe also from this. Maybe last question to wrap everything up what would you say on the hung, these hunger games? Who was the actual big winner?

Speaker 2: 29:56

if you had to choose one I think for me claude surprised me the most because it was like just easy to work with and it was actually good at stuff. I was like expecting GPT is taking up everything everywhere, but it was like cloud was really good at coding and all the rest it was yeah, for me cloud was a big surprise okay, what about you, sophie?

Speaker 4: 30:18

I would say cloud or Gemini also, but I'm maybe a bit biased because I have a lot of experience with OpenAI. So you already recognize the way of talking and everything.

Speaker 5: 30:28

So it's not surprising anymore. Okay, but then so you said Claude or Gemini.

Speaker 4: 30:32

Yeah, gemini is more creative. I think, I think, yeah, and more out of the box. But yeah, very cool can also go wrong, of course asking all of them, or even people right you can always go wrong, all righty.

Speaker 5: 30:49

So thanks a lot. Thank you, yeah, I don't know if I've been doing this with other people, so I don't want to. I want you to feel left out, but uh, thanks a lot for joining. Uh, it will be really cool as well. Maybe we can even once we meet at the regular session, we have the light, the camera and everything. Maybe you can even show some things. Could be a lot of fun as well.

Speaker 4: 31:09

Okay, so I'm not deleting the code yet.

Speaker 5: 31:12

Don't delete the code yet, but it sounds a lot of fun, very cool. Thanks for joining me to chat today. Thanks for having us.

Speaker 1: 31:22

Thanks y'all you have taste in a way that's meaningful to software people hello, I'm bill gates I would.

Speaker 3: 31:36

I would recommend uh typescript. Yeah, it writes a lot of code for me and usually it's slightly wrong.

Speaker 1: 31:43

I'm reminded it's a rust here, rust. This almost makes me happy that I didn't become a supermodel. Huber and Netties.

Speaker 2: 31:54

Well, I'm sorry, guys, I don't know what's going on.

Speaker 1: 31:57

Thank you for the opportunity to speak to you today about large neural networks. It's really an honor to be here. Rust, rust, rust, rust. Data Topics. Welcome to the Data. Welcome to the Data Topics Podcast.

Ben Mellaerts

Host

Murilo Cunha

Host