DataTopics Unplugged
Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society.
Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style!
DataTopics Unplugged
#55 Can AI Predict Euro 2024 Winners?
Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society.
Dive into conversations that should flow as smoothly as your morning coffee (but don’t), where industry insights meet laid-back banter. Whether you’re a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let’s get into the heart of data, unplugged style!
In this episode, join us along with guests Vitale and David as we explore:
- Euro 2024 Predictions with AI: Using Snowflake's machine learning models for data-driven predictions and sharing our own predictions. Can animals predict wins better than ML models?
- Tech in football: From VAR to connected ball technology, is it all a good idea?
- Nvidia overtaking Apple and Microsoft as the biggest tech corporation? Discussing Nvidia's leap to surpass Apple and Microsoft, and the implications for the GPU market and AI development.
- Unity Catalog vs. Polaris: Comparing Unity+Delta with Polaris+Iceberg and their roles in data cataloging and management. Explore the details on GitHub Unity Catalog, YouTube, and insights on LinkedIn.
- Databricks Data and AI Summit recap: Discussing the biggest announcements from the summit, including Mosaic AI integration, serverless options, and the open-source unity catalog.
- Exploring BM25: Discussing the BM25 algorithm and its advancements over traditional TF-IDF for document classification.
you have taste in a way that's meaningful to software people hello, I'm bill gates.
Speaker 2:I would. I would recommend uh typescript. Yeah, it writes a lot of code for me and usually it's slightly wrong I'm reminded, incidentally, of rust this almost makes me happy that I didn't become a supermodel.
Speaker 3:Cooper and Nettix.
Speaker 1:Well, I'm sorry, guys, I don't know what's going on. Thank you for the opportunity to speak to you today about large neural networks.
Speaker 3:It's really an honor to be here, rust, rust, data Topics.
Speaker 1:Welcome to the Data Topics.
Speaker 2:Welcome to the Data Topics podcast. Hello, and Welcome to the Data Topics Podcast. Linkedin, x, twitch, name it. We're there. Check us out. Feel free to join us and leave a comment. Today is the 20th of June of 2024. I'll be hosting you today. My name is Murillo and joined by the man, the ambassador of MLflow, the Italian. Are you Italian? Hello. And we're joined also by the top scorer of DataFoods, and when he's not scoring goals, he's engineering data Correct, david welcome, thank you, thank you.
Speaker 2:Maybe? Vitaly, you've been well. Actually, the first notable mention here is that. Where is he? I know, but he's not here. You is that? Where is he? I know Bart's not here. You can see. I'm seated at his famous chair. See, he looked away for one minute. I took over. That's what we do, brazilians. Bart cannot be here, unfortunately. I'm sure that he's following the live stream. He must, he must. So shout out to Bart. Feel free to leave a comment, I'm sure, I'm sure that he's following the live stream. He must, you know, he must. So shout out to Bart. Feel free to leave a comment here, bart, so you can be with us here. I'll be waiting on that. Apparently, hold on, there is an issue. Let's see, let's see, let's see, let's try this. Hello, hello, all right, apparently there's an issue with the live stream, trying to see if we have audio back. This is also because bart is not here today. Ah yeah, therefore the setup is a bit different I was saying you're already doing better than Bart.
Speaker 2:No, you're not. I think you're a bit too quick, but let's see if we can get these issues fixed. But in the meanwhile, the podcast is recording. So, worst case, we'll just re-upload the video later. Vitale, you've been here before. You're a friend of the show, already Super fan of the show. You still decided to come back, surprisingly. What Well do you want to tell people about yourself, people that haven't heard the first one, haven't followed the first one? Any new, any life updates?
Speaker 1:Yeah Well, for everybody that didn't follow the first, sorry two times right. This is my third time here at the podcast.
Speaker 2:Yeah, I think so.
Speaker 1:I'm Vitale, nice to meet you everyone. I'm a machine learning engineer, currently a tech lead for the AI business unit, together with Murillo at Data Roots, and I'm a great, great, a big football supporter. So tonight let's hope that italy will win against spain you're wearing, wearing your colors.
Speaker 2:I see yes yes, uh, thanks, and david, um, what about yourself? People that haven't, um, haven't heard about you yet?
Speaker 3:yes, this is my first time here, so everyone nice to meet you. I'm David data and cloud engineer at Data Roots, and I'm a supporter of Belgium, so let's hope Where's your teacher. Yeah, I didn't know you were going to be here.
Speaker 2:We ganged up on him, but you said now a supporter of Belgium, I am a supporter. Okay, I thought you said like now I'm a supporter of Belgium. But I was gonna ask what were you supporter before?
Speaker 3:no, no, I was supporter of Belgium. I was gonna say, let's hope they do better. Uh, let's hope Italy is better than Belgium.
Speaker 2:Yeah, that was uh painful. I was talking to some Belgian people. They're like, yeah, I won't even watch the first match, I'll just start watching after the second, because the first one's gonna be easy peasy really. Yeah, I heard that. So you know, I think we all tell this maybe they did it on purpose to give the attention, exactly, exactly.
Speaker 2:But yeah, so, as you mentioned, we are wearing the shirts. I'm wearing the Brazil shirt as well. For anyone following the live stream, you can see me there. Vital is wearing your shirt. Dave, you're not wearing. You also mentioned the first match, right? Why are you bringing this up? What's this about? What's happening right now? Why are you so patriotic all of a sudden?
Speaker 2:a lot of big tournaments, a lot of big tournaments, true, true, true, namely the Euro just started right. Yeah, euro is the european tournament right once every four years. Copa america, the one brazil participates in, is also happening right now. Um, and there's always a lot of anticipation. I feel, right, um, who's gonna this? And that I think football is also a nice sport because it's rather unpredictable. But, of course, us, being data people trying to make predictions, trying to fit models, we also it's natural that every time when a big tournament comes, you always see these predictions right, and I think, david, you shared this one right. This came across our raider. Um, it says it has no so like for people not following the live stream. It has a snowflake t-shirt guy playing football I guess a football player and says predicting euro 2024 with snowflake ml, with Snowflake and Mel. What is this about, david?
Speaker 3:So yeah, this is someone who made an article about the Euro Championship where they predicted the outcome of the whole tournament. Basically, it has to mention the author is British, so it has to be taken with a grain of salt. I guess no, no, but they took some relevant data from Kaggle, from all the international games, from somewhere in the 60s I guess it was, and then they just ran some machinery models that they trained on Snowflake to then predict the outcome of the tournament. The outcome, I think it was often France, england or Portugal, which are also the biggest ones to win, apparently Not Spain. No, I mean, in the predictions I guess Spain is one of the top to win it, but here he didn't get it. Apparently. He said England will win, but yeah, of course.
Speaker 2:Who do you think is going to win, aside from all the data and stuff?
Speaker 3:I mean, I'm still going to believe that Belgium is going to win, aside from all the data and stuff. Uh, I mean, I'm still going to believe that belgium is uh is going to uh be up there. It's uh for sure you truly believe that?
Speaker 3:yeah of course, how much money would you put on it like that's everyone can say it right like um, of course, of course, I hope it, of course. Uh, I think I think we can still progress pretty far in the tournament. But if I really would put money on a team, I think it would be England. I think they have by far one of the best squads out there For those people, do you think England will? If I would have put money on someone, it would be England. Yeah, Wow.
Speaker 3:Not Petrach at all, huh just betraying your country right there. I'm still gonna support them, but if you really have to put something on it, uh okay, and um, what about you, vitaly?
Speaker 2:what do you think?
Speaker 1:so, o thought it starts like that no, well, of course, I will support Italy till the end.
Speaker 2:Four years ago, actually, we are trying to defend our the title of European champions, that's true, huh, I almost forget, because, like you won last Euro but then you didn't make it to the World Cup for some reason, it was super awkward, it was like I even forgot. I feel like I haven't seen Italy in so long.
Speaker 1:But when it's about Europe, we switch our mentality. I see I saw a meme last day. Like Italy in Europe and Italy for the World Cup, Europe was a picture of Leonardo DiCaprio when he was at the top and the World Cup was nowadays. But I think France is going to win. France that's true, but they still have a very, very strong team that is true.
Speaker 2:So here, like in this prediction, so maybe to dive in a bit in this article, I think they are using Snowflake and Snowflake ML right For the compute needs. Let's say so, maybe for the people that don't know what Snowflake is. Anyone wants to give a quick intro?
Speaker 3:It's a data warehouse comparable to Databricks. Store your data, use their compute to process it scalable, serverless, no need to manage anything right, indeed.
Speaker 2:But Snowflake also. Over time they've been taking steps. I think in the beginning it was really just a data warehouse and now I feel like they're taking more steps and really looking more like Databricks. They have notebooks, or actually they will have notebooks soon. They have snow park ml, which rhymes a lot with spark ml, right, for no reason. I'm just saying that they also have a model registry there, so it's also I had. I skimmed through it, I haven't looked as much into it, but it seems also nice how they're using the snowflake platform, right, to really compute the stuff and all these things there. Right, and you said that, for so you see also the code here like actually using the uh, snowflake, no, snow park ml code, which turns out it looks a lot like spark as well. Um, surprisingly, um, but yeah, really cool. Do you know exactly? Did you, did you get into how they actually computed uh?
Speaker 3:I I skimmed also a bit through it. Uh, I think one thing was what I find was pretty impressive was, um, that he was at first using only, like really the ml models and not the ones you get by default already on Snowflake are the Cortex functions you?
Speaker 2:have.
Speaker 3:So you really train the models themselves. But I think it's somewhere that it only took a minute and a half to train like a thousand models or something.
Speaker 2:Oh, really, and then, like in training models, you're always using this Snowpark ML thing.
Speaker 3:Yeah, yeah, I think it's all in Snowpark ML.
Speaker 2:And the Cortex thing they're.
Speaker 3:They're like pre-trained models or uh, I think so yeah, you just like uh have these functions to uh how to complete some stuff like bits. If you send like a message to chpd, let's say uh, and then you get like some kind of response uh, so yeah, these are pre-trained. You don't have to do anything. It's like a plug-and-play model. But yeah, of course, maybe for these you need more custom solutions, I guess.
Speaker 2:Yeah, indeed.
Speaker 1:And I have a question, david. What if, for example, instead of SnowparkML is the name of the library, right?
Speaker 2:SnowparkML. Yes.
Speaker 1:I would like to use, for example, TensorFlow or PyTorch or any other, let's say, framework. Is it already compatible? Can I do this in Snowflake?
Speaker 3:That I'm really not that sure of. I do think you can import these libraries into the Snowpark environment and use them there, as far as I know, I think maybe, for I'm not sure.
Speaker 2:I mean, I know that the the main challenge, I guess, in terms of putting this in practice, is that tensorflow is not set up for distributed computing right, and I think SparkML is. I would imagine that SnowparkML also is. So I'm not sure actually, but do you know how SparkML plays with TensorFlow and whatnot?
Speaker 1:Well, yes, like in Databricks, you can use any of the major frameworks. Of course, if you want to scale to multiple nodes, sparkml is the easiest way to do that, because it's basically Spark with some additional models and features in order to perform model training and inference on a cluster of nodes. But sometimes, when you really want to, for example, fine-tune a model or I don't know, perform transfer learning from a pre-trained TensorFlow model, you can simply spin up a single node cluster in Databricks and use the normal Python library. For example, you can pip, install it directly in your notebook and use it for, let's say, your model training.
Speaker 2:Yeah, I think for me too. I have tried years ago the Snowpark ML no, not just Snowpark ML, sorry, just the Snowpark Python. So Snowpark actually they have it in Java Scala. They have some, I think, javascript UDFs and stuff. So when they announced the Python stuff, I actually gave it a try and I remember it wasn't very smooth actually I think.
Speaker 2:So actually back, I don't, they didn't have notebook environments, but it was more like you can create a Python transformation step from your SQL code and in your SQL steps or something like that. And I remember I had a bit of a hard time with the different dependencies in the environments. Right, like I remember I wanted to install something from pip but then they only had stuff from conda and then my locally I was using pip but then on the conda the package had different names and then sometimes there was an issue to do this and I wasn't sure what was running on my on like that the laptop was running there, or how you set up the dependencies on the thing. I didn't know if they were distributing. So I remember it was a bit clunky then. I know it's a bit challenging, but I know that they spent a lot of effort in this direction.
Speaker 1:But they didn't check properly the combinations of the teams there. What do you mean? Because, for example, belgium is playing against Croatia, italy is still there and also Spain, and they are.
Speaker 3:The three of them are in the same group, but they can't uh, three countries can progress on the same bracket the best of the of all the groups right yeah, so like four out of the six best thirds can still progress and how likely is that uh croatia?
Speaker 2:yeah yeah, spain and italy will maybe another big question here here they have italy-Belgium and they have Belgium beating Italy. What? Yeah, that's what the data says. That's the data. That's not me, okay.
Speaker 1:You want your revenge after Okay.
Speaker 2:Well, this is not the only prediction, right, there's also this one here. This is from the KU Levin, actually the university here, dtai, that's the department there. So they also kind of did a simulation who was going to win? Actually, this has some really nice animations. Again, I didn't look at as much into it as I would like. So I think this is for each group, so they do a simulation for each group and then after that they actually go through the knockout phase and then they kind of have, and again the simulation they have friends winning.
Speaker 1:That's very impressive.
Speaker 2:Yeah right, they actually have here the different paths. They don't have Belgium and Italy, unfortunately. Actually, where is Italy? Is this Italy? Yeah, we will lose against England yeah, which does that make you feel better, or?
Speaker 1:it can happen, but so yeah, impressive visualizations yeah, the visualizations are really cool.
Speaker 2:They're really cool. Look at this, whoa. So we'll link it all on the show notes for people that are curious and wants to play around.
Speaker 1:But this is from KO Levin.
Speaker 2:This is KO Levin DTAI Sports Analytics Lab. Here it's about us, KO Levin.
Speaker 1:Probably one of our colleagues, jonas, yeah, yeah.
Speaker 2:I think he.
Speaker 2:Yeah yeah, yeah, yeah, I think he, yeah, yeah, uh. So if you go here on the, the blog post, they also talk a bit about, so again, I skimmed through it. I wish I had more time to have read it more carefully, but I do know that they take into different, different um. So this is a bit the story, right? So there is more visualizations on the previous uh image that I was showing and here it's more about the explanation behind and here on the predictions, they actually take into account the elo rating. So the elo is like um, I, I hear a lot in the context of chess, right? So we have these matches that one person needs to win, and then the idea is that if I beat someone that is above my ranking, I get a lot of points and the person loses more. But if you're above the ranking than me and you beat me, you don't gain as many points because it's expected that you'll beat me.
Speaker 2:So they also take this into account in the latest matches. They also look into the individual offensive and defensive ratings, the cumulative market value from the starting 11 from each country, based on transfer market. So they have different stats here and then I think they combine this. I actually need to check using three ratings to score equally one another. Yeah, some exceptions. And given the ratings of two teams, when we use order logistic regression model to estimate the distribution we want to draw between the outcomes of the two teams Impressive. So it's a bit more statistical, I feel Right. I think they stick more to the facts, so I guess we'll have to see. So we won, we have friends. Actually, you said friends as well.
Speaker 1:I said friends and you said.
Speaker 2:You said you said England right, yeah, Okay, Brazil's going to win.
Speaker 1:What did the guy from Snowflake said?
Speaker 2:He said he said um yeah, england.
Speaker 3:Yeah, because he read, said um, yeah, like uh coming at the not england yeah, because he read it like multiple times until he got england winning yeah, I think this is I think this is the one, but we are basically saying people from k11 yeah, yeah, yeah, that's true, that's true, that's true.
Speaker 2:Shout out to k11, um, but that's not the only way to make predictions, right, um, so you also brought this one up arguably the best way yeah, orangutan predicts scotland versus germany results at euro 2024. What is this about this one?
Speaker 3:I must admit I haven't read it yet so it's like, uh, it was a zoo, I think in dortmund, uh, and they put like this box for Germany and Scotland, like the scarves for the teams, and, to be honest, the just like took the German scarf and the other one put it away. So they say, okay, germany's gonna win, that's it. The weird thing is this always happens at these big tournaments that some kind of animal is predicting yeah, yeah, right, and it's never like really.
Speaker 3:Yeah, it's always a bit okay, but I think they said that he was gonna do some more predictions. Oh really, yeah, I think it's.
Speaker 2:I guess we'll see. I think in the which World Cup was it that there was an octopus? Yeah, 2010. 2010.
Speaker 3:It was they predicted that Spain, the Netherlands, what was the the name of the octopus? I don't know pole it was pole.
Speaker 2:I think right, probably pole yeah okay, but I think I was back in Brazil because octopus is Portuguese, is polvo, and then if you say polvo pole, it's really cool. I think I really just remember that in Italy it was the same.
Speaker 1:We say polpo pole polpo pole.
Speaker 2:See how we say El Polpopol, el Polpopol.
Speaker 3:See how we say in the Netherlands, poud octopus. Well, not cool.
Speaker 2:The mood just went down. Everyone's happy laughing. Okay, and who? So he just predicted the one match?
Speaker 3:I guess this orangutan yeah, apparently, but I said he was gonna do some more, but he's gonna work need to earn your stay here at the zoo but apparently he did do some other predictions, not only world cup, but like I think it was something, champions League matches that he predicted as well. So he has like a track record and he like track record. It was a good track record. I didn't remember if it was a good track record. I didn't remember if it was a correct or not uh, so he's studying?
Speaker 2:yeah, he's like when the zoo's closed. He's like watching the matches.
Speaker 3:But maybe there was another predict the scottish win. So maybe then we would see that one, that's sure.
Speaker 2:Maybe they just have, like, enough animals predicting enough results and then if whatever wins they just choose, that that's sure isn't there't there? Like a XKCD comic about that, like even if you look at probabilities and you have enough sample groups and you're bound to find one correlation, something like that. Maybe I should read into it more before I bring it up, but anyways, something like that. But more into sports and technology, I have here VAR, semi-automated offsides and connected ball technology. What will be used at Euro 2024? What is this about, vitaly?
Speaker 1:Yes, so we were watching the game, last game of Belgium, and actually they scored two goals that were, say, how do you say, cancelled, yeah, by the technology so far.
Speaker 1:And it was, uh, for the first time that we saw a strange, let's say, plot during a replay, match, match, and apparently this represents a sensor inside the ball that is able to capture all the touches that a player basically can perform on the ball itself so we are using more and more technology to make the game a bit more fair, but I'm wondering if it's not too much, because, uh, let's say, even pro athletes are humans after all, right, so if they need to take into account all these tiny details that you can measure with technology, with advanced technology, how can they, let's say, perform?
Speaker 1:Because, for example I will make a practical example let's say me and david, we are running. Yeah, he's defending, I'm the striker. I need to be careful, that's to take the advantage of my speed. I need to start running a specific moment in time where I'm not to to further the line of the offside, because otherwise the bar will identify it with a very, let's say, millimetric precision. So how can you do, how can you think to all these kind of things while you are playing football?
Speaker 1:also, people, but you mean, like for the, as a player or as a referee as a player, like player, even if you are running right and you need to be careful to not touch even with a finger the ball, because if you touch it slightly it will be measured by the sensor and then it will be basically a fault. Yeah, you will run basically like this yeah, yeah, or yeah, yeah, or you chop your arms off, exactly. So is it really helpful all this technology? I don't know.
Speaker 3:I would like to know your opinion. If it's in your advantage, then yes, but else.
Speaker 1:It's true, but I think but I think it's like. It's not ruining the show. I don't know.
Speaker 2:But I feel like, so, maybe not ruining the show, I don't know, but I feel like so. Maybe actually I was looking for this. There we go, I'll share this time instead. This is what you're talking about here, right? Exactly, they did not watch the match. Oh, I accept proposal the cookies, so people that were not watching. Um, maybe I think this video will show. Maybe the in the var. So the the how was?
Speaker 2:vr stands for virtual assisted referee or something yeah um, okay, they, we also see what the referee sees, right? So the idea is that there are people behind the the screens, behind multiple screens, and they have all this information and whenever it's a critical moment in the match, right? So if it's a goal or a red card or something, the people behind the screens can call the referee so the person in charge is still the referee on the field to go to this little screen that you can see here on the right side and then they can review this from different angles, right? And indeed, as you were mentioning when they were reviewing the play here, there was also this graph which I thought it was sound at first. Probably it is. I think it's a sensor in the ball. It's not sound? Yeah, probably like a touching thing, like a movement sensor, not, because else, if you just scream, exactly, and the whole stadium is probably screaming, right, so I thought it was like, sure, but that was basically. You see here, and for people just listening, it's basically almost like a flat line, except for a peak in the middle, and then, after replaying a lot of times, you can see that the peak happens exactly when the guy touches the ball with his hand. Um, which makes me think again. Like there's a sensor movement, like a accelerometer or whatever you know to to sense whenever there's contact. So the idea is that the ball, like it's almost like you are the ball, you can feel it, you feel the touch, and that happens exactly when the guy's hand is close to the the ball. So it was a handball.
Speaker 2:The goal was, uh, canceled, I guess. Right, it was a head ball. The goal was cancelled, I guess it was disallowed, and then they came period on and Belgium lost. But we talked about that in the beginning. Sorry, david, but it's relevant. So I've never seen this.
Speaker 2:So my concern is not as much on the play, because I think this is just giving information. To be honest, I know it's the rule, but I think the rule should be a bit more flexible, because right now, if you look at this, the guy is running right, so no one runs without moving their arms. The guy behind of Slovakia, you can clearly see he's throwing his body. So it's very likely that his body kind of pushed the guy's arm which led him to also touch the ball. The ball didn't change the course that much. It wasn't like there wasn't a big difference, but there was a touch for me.
Speaker 2:I think the rule should be more flexible to allow the referee to say, no, yeah, it was a handball, but it didn't really interfere. Carry on exactly. But I think it's like I'm I'm not against more technology, because technology just gives more information. It doesn't necessarily constrain people to say, okay, now you have to do this, now you have to do that right. So I think it's more information is good, but I also think the decisions that are taken upon that information could be dealt a bit differently. That's how I feel.
Speaker 1:You are super wise, murilo, I think.
Speaker 2:Can I get another one? Let me Stop. No, I thought it's fine.
Speaker 1:No, can I get another? Let me stop. No, no, no, you're a smart guy, you need to send an email to fifa or some with exact words.
Speaker 2:You said then yeah, the recording of this video indeed, indeed, indeed, but that's how. That's how I feel in general as well. In regards to like the offside and all these things, um, I think it's good, but I do think that the rule should be maybe sometimes a bit more like for this. I think should be more flexible for the offside. Actually, I'm quite okay with you?
Speaker 1:you think so. Like what if it's a centimeter?
Speaker 2:I think then it's more like you just have to draw the line because it needs to be binary right either you are offside or you're not, so so it's like the kind of but it's like kind of like tennis right Either the ball is in or it's out, and sometimes if it's one centimeter out, it's out.
Speaker 1:But there the line is fixed. Here is something dynamic, because sometimes it's your shoulders, sometimes your knees, sometimes your toes even.
Speaker 3:Yeah your toes? Was it Germany? Like the one who touched the ball, it was like really his toes or something. It was just offside, but they're planning on changing this rule. I think it's Wenger, like the old coach of Arsenal, who's like planning on changing this rule. What?
Speaker 2:function does he have?
Speaker 3:I think he's somewhere in FIFA. Oh really, yeah, but he's really proposing that you are one body width away from the other player. There's really space between you and the last defender. Then let's say so that there's a change in this rule, because you feel? Like your toe or your shoulder. Is it really a big difference? Yeah, like Lukaku was like one third of his body or something.
Speaker 2:Yeah, I remember it was uh, yeah, in brazil. So they're also very into it and I remember it was funny like they always have these, these technologies right, like they freeze the play and they do a 360 and there's this 3d simulation of the players and then they have like a invisible wall and you go see right through and there's like a little piece of the guy's like hair. You know he's like just across and it says offside. And then even in brazil they like zoomed in and they could see the cells of the guy and it's like, yeah, offside, yeah, no goal.
Speaker 2:So I thought it was, I know, just saying that's what that was my original point, like yeah, it's not sport anymore, I think no, I get it, I get it too, and but I think to me, I see your point, but I I also see why the offside, the offside rule is there. I do think that if the rule wasn't there, the game would be very different. It would be more boring, I guess, but it's like I think it's more. Yeah, I'm not necessarily against it just because I think we need to have this. We need to draw the line somewhere, and I know it's not perfect, but I'm not sure.
Speaker 3:But even I think it was also this year in the Premier League there were also very hefty discussions about the VAR, because they're not always that correct. I think they had a vote this year Like Wolves specifically wanted to get rid of it, so they had all the teams vote for it and they were the only ones who voted to remove it. So I think, also like the professional players are still a fan of it because it still is like something more fair, like there are a lot of games.
Speaker 3:if you look at the past, before we had the VAR, I think it was once England who like lost the World Cup or something, or like a big game due to something that could have been prevented with VR. So I think, like, in all fairness, it still is a very good tool but, like the ones who are using it maybe need some more like.
Speaker 2:Yeah, no, I see what you. I think, indeed, I think we're also focusing a lot on the situations in which it was wrong, but I still feel like there were way more errors before. Yeah, right, I think it's still in step. So, yeah, I see what you're saying vitaly, indeed, maybe we should adopt the rules, I think. I think that's because I think, well, my other concern is also about the expenses. Right, because now there are chips on the ball and I remember there was the goal line technology chip yeah and then you had to put also like sensors around the goal posts.
Speaker 2:But then I remember it was super expensive, like super expensive, and I can. I remember when I was a kid I used to buy the, the replica of the world cup ball, and it was already expensive. But now, with the chip, is like insanely expensive, right, and I think it's a nice idea. But then you're gonna have, okay, only the rich clubs, only, not even like first, all the first division clubs. Maybe they want to invest, because you're not. It's not just about buying one ball and all these things, right. So I also think a bit of that and I think there may be cheaper ways, like with cameras, to really have a good information on this.
Speaker 2:But you have to get going, I have to go. Thanks a lot. Thanks for joining us, vitale. It's always a pleasure. How do you say go, vitale, forza azzurri, forza azzurri. It's against Spain tonight, right? Yeah, good luck, good luck, we'll see. We'll see how far you go, but David will stay with us here. But thanks, vitaly, for joining us today. All right, what else happened in these past days? I see here, david, that NVIDIA is a big deal.
Speaker 3:Yes, it is.
Speaker 2:What is it about?
Speaker 3:They recently I think like somewhere this week overtook Microsoft. I think they already previously overtook Apple as the biggest tech corporation in the world. Let's say Okay, which I think is a pretty big deal. Because I was thinking about it yesterday, I was like I don't think a lot of people outside of tech really know NVIDIA. Maybe, Like, if you compare it like Apple and Microsoft, yeah, yeah, yeah, it's way less known, but still they're the biggest tech corporation out there.
Speaker 2:Yeah, I think it's like they win because they're leading the GPU game yeah for sure, and I think indeed, it's not as user-facing as a an apple or a microsoft or anything. But but yeah, like, if you think of the, the models and all the things that we need to do, it's like you look at the compute and you think that nvidia they I mean nvidia has been there for quite a lot of time, like on the leading the gpu market and the developments and all these things. So it does make sense, especially now with the lms world, the compute is getting more and more and more valuable, right?
Speaker 3:so yeah, like, specifically, I saw the comparison, um, because this is all in their stock evaluation, uh, but like uh, when gpt, uh, gpt released like october, november 2022 or something, when GPT released October, november 2022 or something. Since then, their stock has risen like 1,100%.
Speaker 2:Really.
Speaker 3:Yeah, you really see it skyrocketing since that release. Of course, there's a big part to play in that.
Speaker 2:I can imagine I think we covered, I think we talked a bit about this a while ago that the there's still a company that produces chips, I think in thailand or something in them, and they were also like, they produce all of them, not just from nvidia, but like, yeah, like. But I remember it was crazy too that they also their profits are also skyrocketing, right with all these things, which, yeah, which is like the ai fever yeah exactly right, the lm fever.
Speaker 2:I would say, yeah, probably. Um well, yeah, have you bought already your your shares?
Speaker 3:no, no, sadly, sadly. If I knew beforehand, maybe but are you surprised?
Speaker 2:did you not know, like, were you like whoa, who could have seen this coming?
Speaker 3:or were you like yeah, okay probably you know it, but you have to act upon it before it really starts to go this far. So yeah, I didn't do that sadly, I mean either.
Speaker 2:But okay, yeah, very cool. Good thing for nvidia. Let's see, and actually so you mentioned like nvidia had already overtaken apple and now they overtake microsoft, and now they're more the mvc, the most valuable company. Yeah, you know, are there another player on that space, do you think?
Speaker 3:yeah, uh, I saw as well that, like intel and they there like uh other like uh big chip producers um are starting to act upon this um and also basically just going to copy what they're doing and uh selling it cheaper because nvidia is selling the very big profit margin, yeah, because, like now they're the only ones uh really this, like these AI chips and so on. But I saw that Intel is like basically going to do what they're doing, but really going to sell it for less, way less to like compete with them.
Speaker 2:So they're basically they're trying to gain markets, yeah, but just making their prices lower yeah, at least in the beginning.
Speaker 3:Yeah, gain market yeah. But just making their prices lower? Yeah, at least in the beginning, yeah. And then, of course, like other companies are trying to see, maybe they're not as good as what nvidia has, but if they're selling this, uh so much cheaper, a lot of them are just going to try that out as well yeah, and I do think there are other.
Speaker 2:I mean, usually we talk about gpus, right, but there are other chips specialized for ai, for I mean, even like google has the if you go on google uh collab notebooks, I think you can also have the tpus, right? I think microsoft also is releasing their own thing with like ai specific chips. I know apple has specific chip as well, like with the apple intelligence thing. There's a lot of stuff going there, right, but I still think that I'm not. I'm, I don't know enough technically to make comments here, but I imagine that NVIDIA has like years and years and years of experience on this.
Speaker 3:Yeah, of course it's, probably the biggest selling product, I guess.
Speaker 2:Indeed.
Speaker 3:But yeah, if you I mean AMD and Intel are for sure going to act upon this. Indeed, they don't have to reinvent the wheel, they just have to do a bit what they are doing.
Speaker 2:Yeah, True, true, true true, true, yeah, let's see. Let's see if it's going to be like this or we have other players popping up as well, could be, and what else. What else? Databricks also had the ai summit. I'm not going to touch too much on this because the expert unfortunately had to leave. Um, but one of the things that I've seen and also I saw this passing on our slack channels um, that they are now they are open sourcing.
Speaker 2:Unity catalog, right, apparently, let's see here, actually, on the, should you step instead on the? Apparently it was very dramatic. I was watching a video, so they actually open sourced it live. They like the guy who was walking on stage. He went on his laptop, he shared the screen and then he went on the settings and like make this repo public, and I was ready. So, uh, unity catalog, I think, is like well, I think the idea is that it's a catalog for data models. It's kind of like all in one kind of thing, right, and I think this follows the. So, yeah, you see here some some more nice images here, how it's like your, your intermediate layer between the platforms and services, with the actual underlying data and all these things, right, so helps you organize and catalog everything um databricks is also known to open source. Quite a lot of stuff, right like spark, is open source. They also did a lot of models, they, they did quite a lot of stuff.
Speaker 3:Yeah.
Speaker 2:One thing that I thought was interesting from this whole Orteo oh, by the way, this is Unity Calong here. Most of it is Java, which was surprising, but I could not let this pass by unnoticed. They also have Catalog RS. They also have something Rust. That's right, everybody's doing it. Otherwise you're not cool if you don't do Rust. But why? I thought it was interesting. I also saw that Polaris, which is, let's say, a competitor of Unity Catalog, this is from SnowflakeDB. They also open-sourced it, and I'm with some air quotes here, right, because if you look at the actual repo, this was open-sourced two weeks ago but there's nothing there. So you see here that Polaris Catalog will be open-sourced and Apache 2.0 licensed in the next 90 days, right? So in the meantime, watch this repo, blah, blah blah, and follow the announcement blog post. So there's also more information here. But I thought it was a very curious they weren't ready for it.
Speaker 3:Yeah, Databricks was like let's go.
Speaker 2:Yeah, they. It feels like one of those ai moves. You know it feels also very forced now.
Speaker 3:Data breaks was probably like get it, we're gonna do this in sufflecos.
Speaker 2:Yeah, it's like I don't want to do that actually it's super strange, like, like, it feels like those half-baked uh ai products. You know that people ship and then they try to work on it after it's been shipped. So it's like, yeah, we had no pre-sourced it, but it did really, you know, um, this is a bit the, this is a bit the blog post, right, about uh polaris, right, so there is a bit more information here on how, how, how it works and how you can write stuff. I'm not super into this world, let's say, but I have her chatted and now apparently it's going to be unity catalog and delta lake. So delta lake is the table format underneath versus um, pilates and iceberg.
Speaker 3:Yeah, um, maybe are you into this follow the bits but they're not like really in-depth like our previous uh hosted yeah, yeah, maybe I want to also share this real quick.
Speaker 2:This is a bit of the the announcement video. So they also took a jab because they announced this, databricks announced they're going to open source and I think then snowflake open sourced it with the empty repo and then keynote from databricks. They actually is this for real, maybe 90 days, right, so they actually very not that suddenly call them out, right, and this is a video in which, yeah, the guy kind of just walks to his laptop very casually and he really just kind of open sourced this live. So it's pretty cool. It's pretty cool, you see here. Do you want to make this report public? Yes, I want to make this public.
Speaker 2:I understand, thanks, so I'm not S in the game, as I mentioned, but one thing that is interesting for me is Snowflake and the Iceberg tables Iceberg format this is also something that came up from a colleague that they shared from a LinkedIn post. That Iceberg feels like an unstoppable train that is going to make data stacks even more fun. I had to try DBT dynamically running models on different SQL engines. All in all, what he's saying is that if you look at this image, it's a good starting point. You have DBT and again we talked about DBT a few times basically orchestrate some queries to transform your data, kind of bring software engineering best practices to data analytics, these things. You have snowflake which, as we mentioned, is like a warehouse, it's a query engine, and then you also can have duck db, which is like in memory, right. So it's like optimized for one node and the idea is that they're both can actually read and write. Well, snowflake can read, right that and write that, the bee can read to Iceberg.
Speaker 2:Why is this interesting in my case is because I have encountered a need as we were moving towards Snowflake. A lot of the data was still on S3, on blob storage, and one way we could bypass having to migrate everything to Snowflake first is to say let's make whatever is available on S3 as an external table in Snowflake. For people that are not as much understanding what I'm trying to say here so Snowflake has storage, has compute, right. It's optimized to run queries. So the compute to run on top of the data that is on Snowflake. But Snowflake also makes it possible for you to run queries.
Speaker 2:So use the compute with storage that is not in Snowflake, right? That's what the concept of external tables on Snowflake is and the tables that we had, the data that we had was in Parquet format, which is a file format, and apparently it was much, much, much slower. So we had this quick fix of saying, okay, the data is not on Snowflake yet, but let's just make the data that is on S3 available in Snowflake and just query it that way, and then, whenever the data does become available in Snowflake, we'll just have faster queries. But it's the same thing for the people that are actually using Snowflake. Yeah, but the performance is really bad and apparently there was some investigations to use Iceberg instead of Parquet and they said that the performance is almost the same as native tables. Yeah, pretty cool right.
Speaker 3:That's pretty impressive.
Speaker 2:Yeah, and I think the other thing that is cool is because for us, the idea is to use Snowflake as a data source for machine learning models. So the first step is to specify what kind of data you want from Snowflake, let's say but even if the data is not all available in Snowflake, let's say that for some reason, some data is not going to go there, it's going to stay on S3, we can still blend that data and run queries that will join data from S3 and Snowflake, because everything is accessible via the Snowflake query engine and because Iceberg has very little. Well, I think we still need to dive deeper as a team, right, but it looks like there's very little performance degradation compared to native tables. It seems like a very, very, very good way, very flexible, to move forward, right.
Speaker 2:And the other thing, too, is that in this project, this trial run, that Boria or Borja, I don't know how to pronounce his name is also saying, is that the idea is that you can have Snowflake for the large queries, let's say, but you can also have DuckDB on that same data or subset ofake for the large queries, let's say, but you can also have DuckDB on that same data or subset of data right Because, again, it's all that. So if you have something that is small, you can just run DuckDB. You don't need to use the whole compute of Snowflake and all the things and pay more potentially. You can just use DuckDB there, but it's actually running stuff with.
Speaker 3:Yeah, so you just use Snowflake then for their query optimization and so on and then use DuckDB or some other maybe that supports the Iceberg just for their regular okay, for tests. You don't really need query optimization with Snowflake. You can just use DuckDB in this case, which is really nice that they then both support it, but I think there are a lot of others who also support it, right.
Speaker 2:Yeah, yeah, yeah, yeah. So I think I saw, so I don't think I put it here, but I also saw that there was a comment on how much, how much this iceberg has been adopted by the bigquery engines and bigquery and all these things. But still work in progress, but it looks cool. So again, I'm not an expert on these things, but it has come into my life yeah, yeah, yeah, so a lot of it as well, Was it like? Not released in open source as well. Iceberg or what I think it is.
Speaker 2:I think this is Apache, it's an Apache thing.
Speaker 3:So I think, it's there. We can look at it. Don't have to use expensive query engines all the time Indeed.
Speaker 2:Indeed, indeed. So yeah, there you go. So if you go here, share this table, it's here, it's open source, written in Java as well. It's not Rust, but yeah. But this is cool and also for the people that are interested in this POC that I just showed on LinkedIn, this is the link so you kind of see a bit the setup, so dbticebergpoc, so it's a bit the working code, so you can actually try it for yourself. I thought it was interesting.
Speaker 3:Yeah, for sure. And I hope it delivers its promises as well.
Speaker 2:Yeah, really something to look into. Indeed huh, indeed, indeed, indeed, indeed. So I was really happy. Well, I'll come back when I once we have, yeah, done the whole exercise and see if it really delivers.
Speaker 2:It almost looks nice but uh it does look nice, right, it does look nice like, uh, like. Finally, you know what else? What else we have from g GitHub? Maybe because you know what Bartz usually says. If he was here he would definitely say that Did a library week Keeps the mind at peak. We don't have any soundbites for that yet, yet yet. Next post.
Speaker 3:Yeah.
Speaker 2:So for this week. This came across again on our Slack channels. The BM2 5S. This came across again on our Slack channels. The BM25S Ultra-fast implementation of BM25 in pure Python, powered by SciPy, sparse matrices. Does any of this mean anything to you?
Speaker 3:No, no, somebody see?
Speaker 2:this in the ghost? Yeah, I don't think they do a very good job in the names, even the description. Welcome to BM25S Python package. Bmi algorithm ranking documents based on a query Text, retrieval tasks, core component of search services like Elasticsearch. So, all in all, tldr as I understand it, is that you know what TFIDF is, tfif, tfidf I think that's a no. Yeah, it stands for the term frequency, inverse document frequency, but basically I came across it in the context of machine learning.
Speaker 2:So maybe I should put here TFIDF, tfidf here I came within the context of machine learning. Tf and DF here I came within the context of machine learning. The idea is, if you have sentences and you, well, this is like NLP natural language processing way before LLMs, right, but the idea is that you have sentences and the way that you can come up from a sentence to numbers is to kind of classify words. So if a word happens, how often does this word appear? Yeah Right, so that's where the term frequency comes in, right. And then inverse document frequency is basically how much this word appears in this sentence compared to all the other documents that you're possibly all the other sentences that you're possibly looking at. So, for example, words that like stop words, like and, but they, they appear a lot on the sentence you're looking at, but they're also going to appear a lot in all the other sentences. Yeah right, so it kind of balances it out. So, actually, like the tf, adf is actually like a ratio, so he's like the term frequency, but okay, um, and I think this is actually good enough a lot of the times for, for example, if I give you a document and it's a legal document, this is probably enough as a machine learning information, right, because the jargon that's going to appear there is probably going to be very specific. But if I tell you a story and I say what's the main character of this story, this is probably not good enough, right? So in one you're just kind of taking a helicopter view and classifying documents, and the other one you actually need to understand what's written there. So for some use cases this is probably enough. For other use cases this is not.
Speaker 2:And now going back to the BMS25, it's something kind of analogous to that. I think it's a different type of algorithm. It's not the FIDF, but it's also to kind of tokenize words. So from sentences here there's a little, there's a little examples here from sentences you can say what kind of you can tokenize these things, what kind of stop words you have, so you can also remove them and you can basically do this very, very fast. So it's a new implementation. This is not machine learning at all. It's just like scientific Python libraries. It just speeds up, yeah, but apparently it's like a lot more so, even like BM25S is much faster than the stuff from Elasticsearch. So, but I also think this is optimized for one machine, right, but there's probably always gonna be a ceiling, right. If you have too much data, it doesn't matter, like so it's a bit of that, but I'm always excited to see you haven't used it. No, I have not used this. I have not used this, but sometimes you feel the pressure to bring a package a day a week, you know so.
Speaker 2:I was like I'll bring this one and also they have a way to load these things to Hugging Face. Easy, so yeah we have Hugging Face Hub token and then you can very easily push it there. So I thought it was cool, I pushing there, so I thought it was cool. I also think it's cool that not everything I mean this is just Python. It's just Python, it's just being using the scientific libraries and stuff. Thought I would share, thought I would share For sure.
Speaker 2:And I think this is it. No, Do you have anything else? Cha-cha? No, I think this is it.
Speaker 3:This is it indeed alright cool, very nice podcast.
Speaker 2:Yeah, because you're here, yeah of course, I think this is the best episode ever actually yeah, any any plans for the weekend actually? For the weekend, yeah, or for the following days the Euros, just Euros. What is the Belgium's next match, saturday?
Speaker 2:Saturday 9 against Ukraine if you lose that, then I won't watch anymore if you lose that one, then they cannot pass right, they're not gonna if you lose that, then it's probably gonna be over yeah yeah, because you have like four teams per group. Yeah, so you play three matches. Yeah, so you lose two of the three. Yeah.
Speaker 3:Probably not, unless if you win the last one and then maybe progress in the best thirds. But then you have to rely on the other ones.
Speaker 2:Yeah, it's a bit different. Yeah, we should win. Actually, I think I even saw here there was a food for thought moment. It was like Argentina lost their first group stage game. They became Did you put that? Yeah?
Speaker 3:Just to give myself some hope.
Speaker 2:Yeah, yeah. So Argentina lost their first group stage game and they became world champion in the last World Cup. Belgium loses their first group stage game. Right, they're going to become European champion, but I feel like I don't believe in coincidence. It's all about the data. The patterns are clearly there. Uh, yeah, let's see, let's see. But I still feel like the belgium's team. It was better a few years ago. Yeah, for sure, like the peak of belgium talent. It was a few years ago, yeah we lost hasar, now it's really sad.
Speaker 3:Yeah, like, yeah, we'll see. I think we do have like a good youth coming up, but they still yeah, some of them are data engineers. Football yeah, exactly yeah, but I think I think we have a good youth coming up. Uh, they still need some experience more than but who who's the rising Belgium? Star, I would say Doku, probably. Doku is really up there. I think I heard First season for man City was already very good. How old is he? 22, something Not really sure exactly, but very young.
Speaker 2:Okay, yeah, brazil also has some rising stars Rodrigo, vinicius, vinicius, yes. I think I would say they're rising. Are you a?
Speaker 3:fan of finnish scissor? Uh, I think so.
Speaker 2:Yeah, it's a little hated he does, but I think it's like it's normal in brazil. We say that they are my hand too, which is like, um, how can I describe it? It's like a bit cocky, a bit like. You know he's not trying to be friends, he's not trying to be agreeable, you know, he's like, he's a bit like, really like, uh, his personality.
Speaker 2:Let's say yeah, yeah, yeah but yeah, but I think he's like different from other brazilian, like neymar, he was also like that, but I still feel like venice sees, I don't know, I have, I don't know, yeah I don't know, I have, I don't know, I don't know.
Speaker 2:Let's just leave it at that they're different, they're different people. But yeah, yeah, let's see, let's see. Actually, fun fact to wrap up the episode I was in the World Cup of 2018. I went to Russia. Yeah, for real, yeah, yeah, yeah, I was there. I applied, I was living in the US. I applied for the lottery tickets and I said, okay, if I'm going to apply, I'm going to apply for like three matches already I'm not going to go there for one match. So I applied for three. My friend applied for one. I got all three. He didn't get his. Yeah, so I was like, okay, I guess I'm going to go. So I went. I watched the last game of the group stage in the first of the knockout. Only the Brazilian games, you mean, or huh?
Speaker 3:only the Brazilian games.
Speaker 2:Yeah, yeah so like in FIFA, they have different packages. You can like have a follow your country, so if they win you just redeem the ticket, so you already pay, but if they lose you get your money back. So that's kind of what I did. But I was pretty confident Brazil was gonna get through the. But so I watched the last two of the group stage and the first of the knockout and it was against Mexico and we beat Mexico and I was so pumped and I was like I'm going to watch the next one, I don't care.
Speaker 2:I had already flights because I was going to come to Europe actually. So I was in the US, I went to Russia and I was going to do some backpacking in Europe. I purposely missed that flight because I took like an 18-hour bus, because it was like very far from Moscow. I slept there. I didn't have a hostel, I just went there. I was like I'm just going to go watch the game, come back. I got to the city it was Kazan. I ran into a guy that I bumped into in the matches before so I said he didn't speak English. So I was like, okay, I'll help you. I went to his hostel. I took a shower there because I was 18 hours on the bus. I just walked around ate. I went to the match. We lost. I was super depressed. I was like I was very happy.
Speaker 2:Super, and it was such a small town that the buses would stop after a while. And the train station was like so I had a train coming back because FIFA, they have the FIFA trains. And then I had to like someone need to give me a ride. A Russian guy that he was going to a party. And then he said, like I'll take you to the station, but I'm going there now. And I was like, okay, let's go. It was such a mess. And then I just got to like it was a party, like I didn't want to party. You know, we just lost. I was just like man, I don't want to watch. I got the train at 5 am. It was like a 10-hour train as well. I got back. It was very sad.
Speaker 2:And then I moved to Belgium to study. So it was very, yeah, better country. Yeah, for the first I was like, oh, where are you from? From Brazil? I was like, oh, brazil, yeah, yeah, yeah, I get it, I get it. So, still a bit salty. Now I'm here. Now I'm here. It's nice. David, thanks a lot. Thanks for having me. Yeah, it was a pleasure. Let's see. Go Belgium. How do you say go Belgium in?
Speaker 3:uh, uh, hip, hip Belgium I don't know I know how we say it like we said go Belgium, go Belgium, belgium, yeah, go Belgium okay maybe just just stop okay, thanks yo you have taste in a way that's meaningful to software people hello, I'm bill gates.
Speaker 2:I would. I would recommend uh typescript. Yeah, it writes a lot of code for me and usually it's's slightly wrong.
Speaker 3:I'm reminded it's a bust, kid Rust.
Speaker 1:This almost makes me happy that I didn't become a supermodel.
Speaker 3:Cooper and Netties.
Speaker 1:Boy. I'm sorry guys, I don't know what's going on.
Speaker 3:Thank you for the opportunity to speak to you today about large neural networks. It's really an honor to be here, rust.
Speaker 1:Data topics. Welcome to theics. Welcome to the Data Topics Podcast.