#42 Unraveling the Fabric of Data: Microsoft's Ecosystem and Beyond Artwork

DataTopics Unplugged

Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society.

Dive into conversations that should flow as smoothly as your morning coffee (but don't), where industry insights meet laid-back banter. Whether you're a data aficionado or just someone curious about the digital age, pull up a chair, relax, and let's get into the heart of data, unplugged style!

All Episodes

DataTopics Unplugged

#42 Unraveling the Fabric of Data: Microsoft's Ecosystem and Beyond

March 25, 2024 • DataTopics

Send us a text

Welcome to the cozy corner of the tech world where ones and zeros mingle with casual chit-chat. Datatopics Unplugged is your go-to spot for relaxed discussions around tech, news, data, and society.

In this episode #42, titled "Unraveling the Fabric of Data: Microsoft's Ecosystem and Beyond," we're joined once again by the tech maestro and newly minted Microsoft MVP, Sam Debruyn. Sam brings to the table a bevy of updates from his recent accolades to the intricacies of Microsoft's data platforms and the world of SQL.

Biz Buzz: From Reddit's IPO to the performance versus utility debate in database selection, we dissect the big moves shaking up the business side of tech. Read about Reddit's IPO.
Microsoft's Fabric Unraveled: Get the lowdown on Microsoft's Fabric, the one-stop AI platform, as Sam Debruyn gives us a deep dive into its capabilities and integration with Azure Databricks and Power BI. Discover more about Fabric and dive into Sam's blog.
dbt Developments: Sam talks dbt and the exciting new SQL tool for data pipeline building with upcoming unit testing capabilities.
Polaris Project: Delving into Microsoft's internal storage projects, including insights on Polaris and its integration with Synapse SQL. Read the paper here.
AI Advances: From the release of Grok-1 and Apple's MM1 AI model to GPT-4's trillion parameters, we discuss the leaps in artificial intelligence.
Stability in Motion: After OpenAI's Sora, we look at Stability AI's new venture into motion with Stable Video. Check out Stable Video.
Benchmarking Debate: A critical look at performance benchmarks in database selection and the ongoing search for the 'best' database. Contemplate benchmarking perspectives.
Versioning Philosophy: Hot takes on semantic versioning and what stability really means in software development. Dive into Semantic Versioning.

Speaker 1: 0:02

You have taste in a way that's meaningful to software people.

Speaker 2: 0:07

Hello, I'm Bill Gates. I would recommend TypeScript. Yeah, it writes a lot of code for me and usually it's slightly wrong. I'm reminded into that the rust here, rust Congressman. I've been made by a different company and so you know you will not learn rust while you're trying to read it.

Speaker 1: 0:31

Well, I'm sorry, guys, I don't know what's going on.

Speaker 3: 0:34

Thank you for the opportunity to speak to you today about large neural networks. It's really an honor to be here.

Speaker 1: 0:39

Rust Data topics. Welcome to the Data Topics Pub.

Speaker 3: 0:45

Hello and welcome to Data Topics Unplugged, the casual, light-harder, corny cozy corner of the web where we discuss what's new in data, from AI models to fabric to, apparently, samvard right, bart, anything goes. Today is the 22nd of March of 2024. My name is Morello. I'll be hosting you today and I'm joined by Bart and Sam. Hey, sam, hello. Sam is a friend of the pod. He's been here again Is the second repeat guest. Bart Spallot was the first one.

Speaker 2: 1:23

Probably probably the first one. But, Sam is our unit of measurement. Maybe you need to explain it for the people that were not here.

Speaker 3: 1:31

And people that are not following us also on the live stream. We are on live stream, by the way. We're on YouTube, LinkedIn X, Twitch. Is there not one Bart?

Speaker 2: 1:42

LinkedIn LinkedIn.

Speaker 3: 1:43

Well, I think it's LinkedIn, and Twitch is the only one that maybe has been neglected. It's like the Duckwind, but we are there. And for the people that are following us on the live stream, you can probably see and we're commenting before we started recording that Sam looks really tall even seated at the couch. So Sam is.

Speaker 2: 2:05

Yeah, we're using a special fish island, so just to get him on the screen.

Speaker 3: 2:08

Exactly. Yeah, it's like a negative zoom kind of thing. So yeah, it's not that I'm short, it's just the same stall for the people that were watching, you know it can also not be both. How tall are you, sam?

Speaker 1: 2:24

1 meter 96. 196,. Yes, I think you're the tallest. Well, there was one guest that you had. That's exactly the size of one Sam.

Speaker 3: 2:35

Yeah, I remember.

Speaker 2: 2:36

How many Sam's are you?

Speaker 3: 2:37

And length Too many. I don't want to. How many Sam's are you? You don't want to publicly? I can quickly. So Bart, for the people that are joining Bart today woke up seeking blood so.

Speaker 2: 2:55

My length is 0.9 Sam's.

Speaker 3: 2:58

You're 0.9 Sam's. Oh, that's pretty tall actually. How many have me send them?

Speaker 2: 3:03

I think I'm just gonna go by guesstimate here You're 0.89. So still, okay, sounds high.

Speaker 3: 3:11

Yeah, it sounds pretty. So you heard it here first. I'm tall. You ever met me? Yeah, like Sam, maybe. I think last time we talked was in. Do you remember, bart? It was like last year for sure.

Speaker 2: 3:24

I would say September, october last year.

Speaker 1: 3:28

Already okay, yeah, yeah.

Speaker 3: 3:30

But any life updates, any exciting things. I'm a bit In my life.

Speaker 1: 3:36

Well, yeah, sure, over the last 60 years, over the last, yeah, yeah yeah.

Speaker 3: 3:40

Any exciting updates.

Speaker 1: 3:44

No, nothing special no.

Speaker 3: 3:49

Are you?

Speaker 2: 3:51

sure Nothing has happened?

Speaker 3: 3:52

No, I don't know, but Titles, prizes, roles.

Speaker 1: 3:58

No nothing Since, yeah, that's what you're looking for, so in basketball how do they call this?

Speaker 2: 4:08

Very valuable players.

Speaker 3: 4:11

Yeah, it's like Very valuable. No, the most valuable players.

Speaker 1: 4:18

Yeah, but for yeah, I didn't think about it. For me, it's normal.

Speaker 3: 4:22

I was expecting it. It was an award, it was just recognition.

Speaker 1: 4:26

That's not it. So I got the Microsoft MVP Award in.

Speaker 3: 4:31

January Congrats, congrats. Can we get the clap? The clap? Yes, go Sam.

Speaker 1: 4:38

But yeah, it's something that you so first you nominate it and then there's a whole procedure and then there's application procedure and this takes quite some time. So I remember that I had to fill in lots of questions while I was on summer holiday. So I didn't For me in my head it's not something I did over the last six months, but even it was just a Tuesday for Sam.

Speaker 3: 5:05

Yeah, woke up, got him became an MVP. It's fine. It's fine, but not only that. There was also. There was more than that.

Speaker 1: 5:11

No, yeah, that's. I also received a DBT Was it Community Award? I think so. Well, they have a program that they always had in DBT, the community spotlight, where I was in then, and in this year they also gave awards to people who contributed to the community significantly. And since I organized a Belgium DBT meetup and it's a few pull requests, I received a reward this year.

Speaker 3: 5:43

Really cool, really cool. And you also a core contributor to DBT.

Speaker 1: 5:48

Yeah, last time I checked I was like number 35 on DBT Core, still number one or two on the DBT. That's for a Microsoft Fabric SQL Server, azure, sql Azure Synapse. I did lots of different things.

Speaker 3: 6:08

Cool, in fact, I think the DBT Fabric. Actually you started it right and now it's on Microsoft.

Speaker 1: 6:13

But I didn't start it. No, this is something started, but I hope that I don't forget anyone or mispronouncing names. A guy called Jacob I don't remember his last name, it was M, because his GitHub username was Jacob M or something and then someone named Michael Aene Mikael Aene took it over. I think he's from Denmark or Sweden. And then the next person to maintain it was Andrew Swanson, who then joined the DBT Labs as an employee and then did not have any time anymore to work on that, since he does that same kind of work now for all the adapters and yeah, I was still using the adapter, needed new changes, updates to it, so I took it over then.

Speaker 3: 7:13

Cool, really cool, really cool news. And then the DBT for it's impressive news.

Speaker 2: 7:18

I've been wondering, like when I hear all these achievements, I'm thinking to myself like what have I been doing with my life for the last six months?

Speaker 3: 7:25

Hiking to Kilos is like, is that?

Speaker 1: 7:28

something I continuously do, Like I didn't contribute over the last six months, and now someone else I think Ty was his- name. Did all the work to get it to DBT 1.7 or something, and then our colleagues here at DataWidth even did the work in deep-defabic to bump it to data versions.

Speaker 3: 7:50

Really cool and we'll be talking about DBT. What is DBT? The dialect behavior therapy thing.

Speaker 1: 7:57

Yeah, exactly the dialectical behavior therapy, something that psychology world. That's what you get when you Google it, and already, if you're writing in caps and people know them, they know what the time my eyes start twitching when you're writing in caps. The DBT is a tool to transform your data. They built a tool that you can use to transform your data using SQL and it helps you apply the perspectives that we know from software engineering on to data transformations in SQL, and they're compatible with any database, any database, and if it's not compatible yet, then someone will make an adapter for it, like I contributed there. And there's lots of things to talk about DBT. They have a cloud version, a core version and so on.

Speaker 3: 8:54

Yeah, really cool. It's a really cool piece of tech. I feel like it's beginning a lot of traction in the past years Really cool. Also, if whoever is listening watching wants to check out Sam's work, there is his blog post. He has a lot of stuff on Fabric and DBT as well. I think Fabric the story with Fabric doesn't stop here, right, I think, for MVP for Microsoft. Is there something related to Fabric on the MVP title, or is it just?

Speaker 1: 9:23

Microsoft MVP, so you get nominated and awarded in a certain category and for me it's data platforms and then within that category you can be assigned multiple technologies one or two I think, and for me it's Microsoft Fabric and then tools and connectivity. That's the link with DBT there, and I think it's quite new that there are not much other people who have that one on Fabric yet, but a lot of MVPs will renew it. I probably know also doing more with Fabric.

Speaker 3: 10:05

So do you have to renew the MVP title as well?

Speaker 1: 10:08

Yeah, every year. Yeah, it's an award for one year and then you might get renewed or not. If you see my content, there is some work to be done there for next year.

Speaker 3: 10:20

I wasn't going to bring it up, but since you did, I don't know.

Speaker 1: 10:25

During the winter I usually done a lot of community stuff contributions, blog posts. This is something I like to do in the summer, interesting, you need sunlight for this. Yeah. I like to sit outside, Really yeah, and maybe play a bit of technology, write some blog posts or go running, and then in my head I already started to prepare the posts.

Speaker 3: 10:48

Bart's like oh, when I run I'm just listening to music.

Speaker 2: 10:52

I actually listen to podcasts a lot. I think the running is indeed good for creative thought process as well.

Speaker 3: 11:00

I heard. I feel like I need to be careful when I say around Bart, as everyone will know by the end of this podcast episode. But one thing I heard is that running, doing these activities you don't have to focus too much on the activity itself is good for learning, for creativity. I think the analogy that they did is the brain is like pins on the pinball thing and the thoughts are like the ball, and when the pins are really close together it's more likely that you're activating areas of your brain. But when you're running it's almost like the pins spread around, so you have more connections with things that maybe you wouldn't do when you're just focused on the task.

Speaker 3: 11:36

So they were advising people that to learn effectively, you should also alternate between deep focus but also these passive focus, like you're running or you're doing this or doing that. So there is some. I also know a personal story from someone that was like, yeah, I had this bug and I couldn't solve it. I spent like three hours on this and I went to play football that night and then in the middle of the game I was like oh, I know what's wrong.

Speaker 3: 12:00

I just left, I just solved the bug, so maybe there is some truth to that and so we talked about a bit. It touched up on fabric. We're going to dive in a bit deeper.

Speaker 2: 12:10

There Part has one very Maybe before we dive in for the people that listen to us and pointing upwards to the wall. Really, what is above me?

Speaker 3: 12:24

It's a duck painted. Duck, hand painted. Yeah, but who is the artist that, actually this part? It's Alex. Yes, Can we get around a plus for Alex as well? Yeah, Alex actually did this Hand painted you know like part of her responsibilities at Beta Roots part to sign her to this task too, that's not true, that's not true, this is fully on Alex.

Speaker 2: 12:50

She came up with this idea. Looks really cool it looks.

Speaker 3: 12:53

really cool, it looks really cool. Thanks, alex, indeed Indeed. Yeah, alex is also responsible for the design.

Speaker 2: 13:01

Is a duck now officially our mascot.

Speaker 3: 13:03

I think so. I think Bart is like he's the one that picked the duck, he went with it, he did the logos, he did everything. He's like oh, he's the few mascot. I was like well, Bart, it's.

Speaker 2: 13:13

Never discussed it right when it was a mascot. From now on it is the mascot.

Speaker 3: 13:18

That's Bart's strategy for everything. He does stuff. And then after two years it was like, oh, it was never official, I wasn't trying to push anything.

Speaker 2: 13:27

It's for you. So now right, yes, yes. And does the mascot have a name?

Speaker 3: 13:33

What is the name, bart? I'll let you Quackers, quackers. Okay, quackers it is. I like it.

Speaker 2: 13:43

Quackers McFluff. I think he needs a bit more body to it, quackers McFluff.

Speaker 3: 13:50

So you went with a duck right Duck as a mascot. You were the one that initiated it. Is there a backstory to it?

Speaker 2: 13:56

Really, I can just think often about ducks.

Speaker 3: 13:59

Really.

Speaker 2: 13:59

I think that's Not consciously.

Speaker 3: 14:03

I feel like there's a lot to unpack there. I'm not sure if I have enough time, so maybe I'll leave it at that.

Speaker 2: 14:08

I think that's something for another episode, yeah.

Speaker 3: 14:10

I think for another episode.

Speaker 1: 14:12

There's a lot of data touring that's related to ducks. We have DuckDB obviously. But you see, there are prefect ducks and they call him Marvin.

Speaker 3: 14:21

Oh really, why, Marvin, you know?

Speaker 1: 14:23

No, that I don't know.

Speaker 3: 14:24

Maybe you want to show for the camera I think Quackers McFluff is a better name. Yeah, yeah, I think it's more yeah, there we go, and I think for prefect, the company itself has nothing to do with ducks, the logo, but I think the idea with duck is like the developer the debug duck. Debug duck. What is that part?

Speaker 2: 14:45

I think it's better for some to explain. It's been a long time. For me, it's like you're a companion to debugging right.

Speaker 3: 14:52

But I usually don't do bugs, so I don't know what it's for. I don't know how you all solved that. I just don't put that in the first place. What is that?

Speaker 2: 15:01

So I'll explain it here, the text in front of me. It's called rubber duck debugging. It's a method of debugging code by articulating a problem in spoken or written natural language, and it references a story in a book, a programming book, where a programmer would carry around a rubber duck and debug their code by forcing themselves to explain it line by line to the duck.

Speaker 1: 15:27

Did you try?

Speaker 2: 15:28

No.

Speaker 1: 15:32

I learned to code by myself, but then on my first job, I worked together with someone who was self-employed, a very small company. So he and his wife, we learned a lot of what I know today on software engineering, and he taught me that explaining the issue often solves the issue, just because you also start to connect dots in your head that you didn't within two otherwise. And he taught me just pick a rubber duck.

Speaker 1: 16:06

And he gave me one, and then put it next to your desk, and then when you don't know something talk to it. And it does help, but it's super awkward, of course.

Speaker 2: 16:17

I think someone else sees you. You need to explain it, right?

Speaker 3: 16:19

Yeah, or maybe you just need to like you can have AirPods nowadays and you just pretend that you're talking on the phone. That's what I do when I'm talking to myself on the street and some people start looking at me. Funny. I was like, oh yeah, okay, thank you, and I'll just take it off and I stop. That's how you play, cool. So tip for you. But yeah, actually I never heard it like explaining line by line what the code does.

Speaker 3: 16:44

No but the problem that's your face. Yeah, that'd hurt. I mean, actually, I think most of us probably experienced that. Right. You went to someone, asked someone a question and then halfway through the question like, oh, actually never mind.

Speaker 3: 16:53

Yeah, that's true, that's the idea, like if you just talk into a duck and I think the way I reason about it is for you to be able to formulate what's your problem, you have to think clearly about it, right? I think so Pretty cool, pretty cool. And we're talking about fabric. You got MPP for Microsoft Fab.

Speaker 2: 17:18

You have a question right, I have a question, but I was asked not to ask the question, but I'm going to rephrase it. So I wanted to ask what is fabric? But apparently it's a very broad question, so I'm going to try to rephrase it a little bit, because I do. I am interested in it, in a sense that fabric is yet something else in the data ecosystem and I think it's interesting for people to understand. Like what can they compare to it? What does it replace? What does it not replace? Is it a drop in replacement for snowflake, or is it not? Is it something for data breaks? Is it something for tableau? Maybe even Like where does it start, where does it stop? Like what is the positioning of Microsoft Fabric?

Speaker 1: 18:03

So what you should look at is that today, before fabric, you asked someone to design a data platform, and what you end up with is this huge diagram that you get, with all these fancy logos and arrows all over the place, and that's usually complex for people who just want to basically carry tables and see what's inside and make report out of it. Well, that's a bit minimalistic, but that's what I want to do in the end of the day.

Speaker 2: 18:40

In essence, that's what it is.

Speaker 1: 18:46

Well, on the Microsoft stack you also had this. So if you wanted to work with data, you needed Azure Data Factory to ingest data into your data lake, which is by itself not a service, azure Storage accounts. Then if you want to run Spark, you could go for Spark in Azure Synapse or well, and so on. There are lots of different things that you have to connect with Fabric. It's just one thing that has everything, and I think the other two competitors who have similarly offered one service that does everything, or a snowflake and Databricks Databricks a bit more or less, because it's also cloud service in the three clouds, but I mean, if you log into Databricks, there is one platform where you can find everything that you need usually. So that's a bit how you should see it. Of course, it's called the platform for the age of AI, for the era of AI.

Speaker 3: 19:51

Why is the era of AI there? Why is it different that makes it the platform for AI? Is it just because AI is super sexy right now?

Speaker 1: 20:01

That is all I think, but also Fabric is really. If you want to properly work with AI training machine models and so on you need scalable infrastructure, and you're not going to do that on technology from 10 years ago. You need something that can easily ingest data, work with data, scale the compute capacity so that you can train your AI models or work with AI in there. So AI has many different aspects. Work is a platform that you can use to build your own AI solutions. It also offers lots and lots of AI features in it. Microsoft, the company known for co-pilot, will obviously also see where they can put co-pilot within Fabric, and they already have co-pilot available in certain features of Fabric today. So if you want to get the most out of AI, both as a consumer as well as a developer, then Fabric is a platform to go to.

Speaker 3: 21:12

Cool. So I see here they have the SQL, I guess, for doing transformations and stuff.

Speaker 1: 21:18

This is in notebooks.

Speaker 3: 21:20

This is in notebooks for the AI part, I guess. So also the reports.

Speaker 2: 21:25

So it's really an A to Z kind of solution. So there's a way to adjust data there, so it replaces a bit like. So it has functionality. It's functionality of something like data factory or airflow. You have storage, you have an aquarium on top of that storage and you also have more like more Atoch ways to explore, like notebooks, these type of things to interact with your data.

Speaker 1: 21:51

Yeah yeah, Spark functionality is everything that you can do with Spark. It's in there, real time. It's also a big. Thing in Fabric.

Speaker 2: 21:58

And is this, if we talk, for example, about data ingestion or the oligopolis, like? Is this all the mics of things that were already there, that were repackaged, or is this really something that they built from scratch?

Speaker 1: 22:11

It's a bit of both. I had lots of discussions on LinkedIn and the community as well, with some people calling Fabric just a rebranding of things like what he had, and I disagree firmly with that. It's not a rebranding. Even reused lots of branding there Fabric by itself. So for me, what is that? I'm a data engineer, so I look more at data warehousing capabilities and so on and the whole engine that you have in Fabric. It's called Polaris.

Speaker 1: 22:45

The project is something they built from the ground up, which was already available in some way in Synapse as a Synapse serverless engine, but everything that you could do with serverless it was quite limited In Fabric. You can basically use it as a fully functional data warehouse. So that's one engine that's really built from the ground specifically for this product. Then the data factory in fabric they also call it data factory. I saw people calling it Azure Data Factory, calling it Azure Data, but it's not Azure Data Factory because it's also a more simpler version to use. It works in a bit of a different way. It offers more flexibility. I mean, in data factory there was a concept of data sets and so on. It's not available in fabric and it was also something that I usually didn't find that useful. For a lot of features, they really thought about it. It's like if we would now rebuild this from scratch, how would we build this? Yeah, then that's the version that you have.

Speaker 2: 23:57

Okay, for these different components that are part of fabric, like how is the integration between them, the interactivity between them, in the sense that, in my eyes, microsoft was not always degraded? In this there's a new service, but integrating that into another service like takes a lot of time. How is that with fabric?

Speaker 1: 24:19

That's quite easy. The fabric is a lake house-based design. This means that you have storage separately, compute separately and everything in fabric is stored in the one lake. It's one lake because there's only one might seem obvious.

Speaker 2: 24:39

Everything that you have in terms of data will in some way end up in one lake One lake is called one lake, not that there is one lake, no, the name is actually one lake, like one driver or something Within one lake.

Speaker 1: 24:57

Your data is there. You can consume it through the different engines. You have the Spark engine that can read and write data from and to one lake. You have the data warehouse that can read and write data to and from the one lake. Because that one lake is a shared layer between all of them, it's quite easy to use them all together in whatever way you desire.

Speaker 2: 25:20

Does that mean that storage of data is in files like parquet files? How does the storage look like?

Speaker 1: 25:27

They chose for Delta Lake, the format that Databricks originally developed. It's now open source.

Speaker 3: 25:35

there's a format that's also being used within one lake, Maybe because you mentioned Databricks, but is there any link between Azure and Databricks as well? I know that, for example, MLflow they actually use the MLflow that is from Databricks as well. I think it's very easy to start Databricks workspace from Azure account.

Speaker 1: 25:59

Databricks is a first-party service on Azure. That means that it's deployed as any other native Azure service. Deploying a Databricks workspace is the same complexity as deploying an Azure storage account. It's really first-party. I think that Microsoft invested some money into Databricks, but they don't own Databricks. Databricks is independent. They offer their product on GCP and AWS as well, but they share lots of technologies. There's Delta Lake for the storage. Then there's Spark, also sort of using by Databricks, and now open source in the Apache Foundation. There's MLflow as well.

Speaker 3: 26:48

The promise, then, with Fabric is you go there and that's all you do. You don't need to do anything else.

Speaker 1: 26:54

Exactly and it's also super approachable. All the previous data features that you had in the Maxos stack were on Azure. Basically, you had Power BI on one hand, but then Azure Data Factory, Azure Synapse, Azure SQL, if you want to do something there. These features are therefore a bit like the way, a bit more Someone who works daily in Excel or even more data focused in Power BI. They usually don't go into the Azure portal and start opening services and so on in there. That was more something for data engineers, cloud engineers and these kinds of profiles. Now, because Fabric is part of the other kind of cloud that Microsoft has more part of, the Power Platform next to Power BI, it's very approachable to another audience.

Speaker 3: 27:59

But if you're on Fabric, you don't like. You mentioned Power BI. Now you don't need to go to Power BI, because you can create dashboards already in Fabric as well.

Speaker 1: 28:08

They actually integrated Power BI in Fabric, so Power BI is one of the big parts of Fabric. If you now go to apppowerbicom, it will just open Fabric for you.

Speaker 3: 28:20

Really.

Speaker 1: 28:21

Power BI is one of the main features of Fabric. Everything with the dashboarding, reporting has not gone away. They added lots of new features there as well. It's also nicely integrated. There are also struggles if you use Power BI with your data on how to connect and so on. They also fixed that with DirectLake, so there are really cool things in there to help the Power BI users as well.

Speaker 3: 28:45

Is it really expensive? What's the pricing?

Speaker 1: 28:51

No, I think it's A bit of a.

Speaker 2: 28:56

Is it really expensive?

Speaker 3: 28:58

I think it's more like I don't know. I think there's no place where people are expensive or not, but I think you only pay for what you use. Sagemaker they advertise as you pay, just as much as you use, but it's just the services underneath. So is it Fabric, something like that, or is it a completely different?

Speaker 1: 29:18

Fabric is really built for the cloud. It's also a SaaS product, so Microsoft can optimize the costs much more than they can do with other services, and that's something you notice as a user as well. It's pay as you go. That means that you can create Fabric SKU on Microsoft Azure and then the lowest one starts if you keep it running for the entire month at 250 euros or dollars something like that a month and that one can already deliver some great performance. All our data products sometimes need to go a bit higher in budgets for the entry level, but can also go to a budget of thousands a month.

Speaker 3: 30:10

It's depending on your skill, but I think it's actually made quite a lot of noise from the latest Azure things. I think the Fabric is much bigger than I think Synapse was the last one I heard.

Speaker 1: 30:23

But Satya Nadella calls it the biggest innovation in data since the original release of SQL Server. Sql Server is one of the biggest database engines worldwide, so let me say something.

Speaker 2: 30:39

Maybe to come back a little bit to the storage where you were explaining Delta Lake, which is sort of a version of Parquet, like it's built on Parquet, but it's typically for structured data. I guess you also use one lake for unstructured data. Let's say, take something random here, audio files that you want to process, and drop in anything.

Speaker 1: 30:59

Yeah, indeed, one lake is still a data lake, so in the end, Can I see it basically as a blob storage? It is actually a blob storage underneath.

Speaker 3: 31:11

And maybe blob storage for people that never heard this before.

Speaker 2: 31:16

I think an S3 bucket would be the thing that more people heard of which is like an online file storage. This is like a folder Just put stuff there.

Speaker 1: 31:25

Yeah, dropbox One drive. That's a bit how you should see it as well. It is built on Azure Data Lake Gen 2, which is just Azure Storage Accounts with the hierarchical namespaces enabled, so that enables the folder hierarchy in it, and now it's offered in another way, so underneath, still that same storage service Azure already had, but with one lake it's easier to access it.

Speaker 2: 32:01

You mentioned also this concept of decoupling storage and compute. Maybe to unpack that a little bit for people that are unfamiliar with it why is this a good idea? And maybe to start from a more typical example, like something like MySQL, which doesn't have this right, why is this a good idea?

Speaker 1: 32:21

when you talk about a data platform, On a typical database like MySQL, your machine that does all the calculations.

Speaker 1: 32:31

If you want to do an intensive query to join a few tables together to do some aggregations, that same machine also holds the storage.

Speaker 1: 32:40

Classically, that means if you want to store more data, it's that same machine that you would have to increase the storage shelf and it will also limit the flexibility because all data that you want to work with, if you want to do any calculations, has already to be part of that SQL database machine. If you decouple them something we first start with with Spark, a lot that you would decouple where the compute happens, where the storage happens, then you can use very cheap cloud storage solutions like S3, like Azure Data Legion 2, where you can have terabytes for only a few years a month, and combine that with machines which are really focused on this compute capacity. That means that you can scale both of them more easily and don't have to worry about the size of your data that much anymore. Duplication of data is never a good thing because of the governance aspect of it, but you should not really care anymore if you maybe store a few gigabytes of data more or less.

Speaker 1: 33:59

It's a processing that's really going to count for your cost and efficiency.

Speaker 3: 34:04

Yeah usually that's storage is cheap, right, the thing that you usually pay the most, the thing that drives the bill, is actually the computing. You only pay for what you're actually using. Matt, as you were saying this as well, I mentioned snowflake. You mentioned data bricks. It does feel, because I know snowflake. I think I read on their blog post or something. It's not public yet, but they are planning to offer notebooks as well. They also have a scheduling thing like a DAG thingy. It's a bit primitive. You declare with SQL and I don't even know if there is a UI that you can see the graph, but they do have something like this Are they all going towards the same direction?

Speaker 1: 34:48

Well, it's a bit lucky for us as users. They probably always try to compete with each other as much as possible. If they see that one of them has some really cool new feature that gains all of attraction, then they will also probably do the same thing in the other data platforms. It's so interesting to see. It's interesting to see who they are. So For me, I find it a bit strange that the company of the size really best, for example, doesn't have such a big platform that says everything, and that we have to look at data-big snowflake fabric for solutions. Not every company needs something of that size, of course, as well. It depends a bit, of course, like everything in IT.

Speaker 3: 35:35

What is?

Speaker 1: 35:35

the right choice for you.

Speaker 3: 35:37

And maybe, if I have all the options there, why would someone go for fabric and not snowflake or date-bricks? What is the trade-offs there?

Speaker 1: 35:52

So I think the pricing is interesting, so it starts really low. They also have a pricing model I can talk more about that where you really see that it's a sales product and that Microsoft is the only one who can do this, since they own both the cloud, the actual data centers, as well as the product, so they have the capability to scale everything efficiently and to make sure that all resources are used as cost-optimal as possible. So, in terms of pricing, if you compare them, fabric will usually be one of the more interesting ones and accessibility. So since now it's a brand new platform designed for today's business needs with a few episodes ago, we talked about analytics engineer.

Speaker 1: 36:46

That's also the main persona at US Max of Fabric. It becomes more accessible, right? So the way you before need it's a team of cloud engineer, a data engineer Maybe it shouldn't be a job, but it develops engineer as well often and the BI person who does the dashboarding, and then maybe some analysts to understand what's happening. With Fabric, the technology has become so approachable, with lots of low-code solutions as well, that everyone can start to use it.

Speaker 3: 37:23

Like low-code solutions. You mean like creating the tags with drag and drop and stuff.

Speaker 1: 37:28

With lots of coatings in there for low-code, so you can need to drag and drop in the data factory tooling to just say first this task, then this task, then this transformation, connect them with arrows together and so on. The PowerCry feature from Power BI was quite popular amongst the users of Power BI. That's also available within Fabric so that you can transform your data with data map. Data flow gen 2 is a name, I think.

Speaker 3: 37:56

What does that do?

Speaker 1: 37:57

It's a view, like in Power BI, that you have a table and then the ribbon on top, like in every office tool, and then say Pivot or Transform, or maybe if you have a CSV and your header row is the one, these kinds of things.

Speaker 3: 38:12

you can do so many people in the table by interacting with the table itself instead of having to right In the UI.

Speaker 1: 38:16

Yeah, and even then, for the people who prefer to code in the notebooks, you can also go by to code for you. That's also quite easy, and they also have a data wrangler where you can also get thrown in into a table interface due to transformations, and it's a bit south of Spark code for you, documented and all.

Speaker 3: 38:36

Really, really cool.

Speaker 1: 38:38

So you have a mix of everything. If people want to code, they can code. If people want to use the UI, they can do so as well.

Speaker 3: 38:45

And so you mentioned Spark. So you still run Spark on Fabric.

Speaker 1: 38:49

You can. So Fabric has several different engines running on top of this one leg, one of them being Spark. One of them is the SQL-based engine, which the project name was Polaris, where you interact with your data in SQL, so you can also use DBT to transform your data. You also have the streaming part of Fabric, the KQL Kusto query language database. So it depends a bit on what you prefer as a user. If you're a team that wants to do everything in SQL, then you can do SQL. If you're a team, Python as a preference, and that's also possible.

Speaker 2: 39:28

And maybe to jump on Polaris. Polaris is a query engine that they built from scratch right, which is still relatively young. Are there limitations to Polaris today? Are there reasons not to go for Polaris Because it is their native query engine?

Speaker 1: 39:47

No. So I was at a conference in Denmark, the first data platform conference. That was all about Fabric and one of the main, I don't know, is official title. So Bogdan Grievat. He leads the engineering side for Polaris.

Speaker 1: 40:10

Now for the implementation I think Something with the Synapse SQL and so on. It's his thing. He explained that. So the engine is ready to use. What they need to do now a lot is expose all of these features to the users. So the engine is there, but maybe the SQL syntax to talk to it is not there yet and then they might need to add it. There are some limitations on the SQL language that are still there, some parts of the T-SQL dialect not supported yet. But it's just a matter of partization and backlog and so on for a lot of them to see when they would arrive and this work that is going to happen probably in the coming years, depending on the need, of course. Then, in terms of performance, they keep on improving working with larger data sets. As far as I recall, they could comfortably work with a few petabytes now, but not enough for every company.

Speaker 1: 41:20

So, they keep on increasing that, and if they increase on the high scale, you obviously also notice the benefits on the lower scales. Do you know what it is?

Speaker 2: 41:27

implemented in which language?

Speaker 3: 41:30

Rust, I don't know.

Speaker 1: 41:33

I would guess C++.

Speaker 2: 41:37

I have something embarrassing to admit Polaris.

Speaker 3: 41:42

Hold on, everybody stops what they're doing.

Speaker 2: 41:46

We heard about Polaris almost a year ago now, like the first, maybe even longer time flies and actually when I heard it, like in passing, like they're building this, I remember this. I heard it in passing and I thought it was about Polaris this is a query engine built in Rust and I thought cool, cool that they're using Polaris.

Speaker 3: 42:07

So what you heard is that they're using it. You didn't hear like they were building it.

Speaker 2: 42:12

Well, I heard it in passing, this announcement, and I was assuming like Polaris is going to be the query engine on Microsoft Fabric. Every question did and it's only like, don't record this. I think it's only like six months ago that I realized. Oh, shit, this is something else, this is not Polaris, this is Polaris. So yeah, there's that Okay.

Speaker 3: 42:35

It's okay, but as we don't make mistakes, see, just remember the grace that I'm extending to you Because later.

Speaker 2: 42:43

I see, there You're a bit anxious for the hot take.

Speaker 3: 42:47

I don't know you got me on my edge.

Speaker 1: 42:49

You know it was like okay, I went into the history of it a bit to understand where they started and they published a research paper, I think in 2020 in the VLDB.

Speaker 2: 43:02

It's already that long ago.

Speaker 1: 43:04

Yeah, and this was like the conference, where all cool kids with database engines come to show their database.

Speaker 2: 43:11

Is this it yeah.

Speaker 1: 43:14

And so there should be a research paper. It's very very well.

Speaker 1: 43:19

That's the one and you see the first name there, joseph Anguilar Saberit. He is the main author of this paper. Sorry, what I did? You have all these names there. Go to their LinkedIn and see when I started working on a new project that I call research project or something within Microsoft, and it seems that I started project 2018. I never got the official information or something, but from what I see from lots of these people's history on LinkedIn, that must be 2018 somewhere that I started the project Interesting and this is the conference or is the research group Very large databases.

Speaker 2: 43:58

From what moment onward are you allowed into this group?

Speaker 1: 44:02

I have no idea. It needs to be very large, you're just large. I think this is where the snowflakes and big rallies and so on are presented when they create their new stuff.

Speaker 3: 44:13

I was actually looking for the paper back.

Speaker 1: 44:16

I think it's on my blog in the post about it.

Speaker 2: 44:20

Everybody go to Sam's blog. It's on the screen, the Brune.

Speaker 1: 44:24

You go to the most entire blog thing there and then which blog post was this? I think there is Microsoft Fabricary branding thing. So the second one on the top, if you look there for Polaris.

Speaker 2: 44:41

We'll link to the show notes.

Speaker 3: 44:43

And there is the paper here. There we go this is the paper. I think it's true. And do you monitor these people still? And then when they're like, oh, we're working on a research project, something's up.

Speaker 2: 44:57

Shit. They started off something.

Speaker 1: 44:59

I'm going to buy more stocks from Microsoft. There's something interesting in the paper that's also never got any word out of from Microsoft. It mentions Fido. If you control F, you will find this, and there's also paper mentioned about this Fido, because now you see the two and it's referenced at the end and that paper is never made public. It's within Microsoft internal.

Speaker 1: 45:23

But this must be some kind of storage project, a data lake kind of thing that they wanted to combine originally with Polaris, and this never got published or it never saw the light of day in any products. This one lake is actually a data lake, gen 2. My bit. I have concepts from Fido that are now implemented in there to make one lake work Interesting. Yeah, you see, it is something else on Fido on that screen.

Speaker 3: 45:56

Interesting. There's a lot of you really went deep on the.

Speaker 2: 46:01

Maybe it's something to use internally for their own process.

Speaker 1: 46:04

Yeah, it might be, very interesting, cool, cool, cool.

Speaker 3: 46:09

Maybe, with all these, maybe I'll be okay to Give the ball rolling. Get the ball rolling, marina. Get the ball rolling.

Speaker 2: 46:17

So we talked about fabric.

Speaker 3: 46:24

I asked how do you compare with snowflake and things like that? And sometimes, especially when we talk about databases. So, more broadly, I know fabric is more than just that right. A lot of times you talk about performance and actually you shared this post, yeah, a long ago. This is from Motherduck, so this is one of the duck IT companies that Bart got inspiration from before.

Speaker 2: 46:47

I don't know, maybe I'm the OG right, I've been using ducks for a long time.

Speaker 3: 46:52

Sure, you have, I've seen, but it's like I never questioned it, but it never made sense either. You know, like you put like slack status and it's like a duck one.

Speaker 2: 47:03

The best things in life don't make sense, okay.

Speaker 3: 47:07

Can we?

Speaker 2: 47:07

call that Don't question it.

Speaker 3: 47:10

Yes, sir, but yeah, but anyways, motherduck, so they actually have here like the perf is not enough. Actually, you shared this originally. You caught my attention as well, and it talks about the cult of performance and databases. It actually gives a bit of like a benchmark, like there's this famous TPCH and TPCDS, right?

Speaker 3: 47:32

I also think for me sometimes, like I look at these, I spend a lot of time, especially not only on databases but also pollers. You know, when you talk about pollers, you also see the benchmarks and pun does and the TPC and this and that, and I spend quite a lot of time like trying to see like what's the best one. I also have this. I'm going to start a new project, what should I use? And then you kind of go in the rabbit hole oh, this is a bit faster here. And then the conclusion that I have, the conclusion that I have a lot of the times with these databases, is like, yeah, like this, pollers is the best. When you have this much data and you have to do two group buys and you have to wear clauses and you have this and this and this, you know basically when you run the TPCH.

Speaker 3: 48:14

Exactly Right, and I even did a talk at last year or two years ago. Last year that was also like it wasn't about databases, but it was about timing, execution time for Python scripts, and I actually made a little disclaimer on the presentation saying that all the benchmarks are wrong, but not just mine, the ones that I put here. But in general, right, like everything is wrong because unless you're doing that exact same query with the exact same amount of data, with the exact, this exact that it's not going to be you know, but at the same time.

Speaker 3: 48:49

So I said that at the conference, but then I was thinking about it, but at the same time, I do think that they're useful. They're all wrong, but they're useful Because, for example, pandas, it was never the top five on the benchmarks. Right, which says something? Right, I feel like-. What does it say?

Speaker 2: 49:09

It says that probably there are better alternatives in terms of performance, In terms of performance speed. So it's like.

Speaker 3: 49:16

If you talk about, like, I think, people that use pandas today, they're using more because they API, because it's very popular, there's a lot of documentation, it's a lot of history, right, so that makes sense. But when you talk about performance and I think it would be the same for databases if you have one database, there is always on the top three. Even though all the benchmarks are wrong, according to me, I still feel like there is value in looking at them as a whole.

Speaker 1: 49:40

Well, yeah and yes and no. Yeah. So, for example, dougdb is a database that's gained lots of traction, but if you look at the benchmarks that Paul has published, dougdb is not in the top 10 there, as far as I remember. Maybe I don't couldn't quote me on this, but you should have to look it up. But I think what this blog post says is that all these benchmarks they're synthetic and, like you say, they are only efficient in this benchmark, but it doesn't show what else they have.

Speaker 1: 50:22

If you have a system and it can process your queries all within one millisecond, but it takes you 10 days to boot up the system and then you have to wait again for 10 days to run your next query to next set of queries, then how efficient is that one millisecond?

Speaker 1: 50:37

Or if you have to tap your SQL queries and let's say, you have to between every character, put a dot in between or go into the fastest one or the one that you can actually use, and that's, I think, what this post is trying to say as well that your performance depends on what you need. And if you have lots of queries with joins or something and with group buys, then you have to evaluate against that if other companies have queries and they only do select one or something, then benchmark against that specifically for you and then, even if you do it on your laptop or it's on my laptop, it will already be different and typically what is published is very biased, because you are trying to create a story, especially if you're a company that just made it a database.

Speaker 2: 51:29

We're not published if you're a second one.

Speaker 3: 51:31

Yeah, that's why even we talked about and there is some I mean there's some truth in this, like there's you need to look at it in a holistic way.

Speaker 2: 51:39

I think that is a bit of the story that we're trying to. To make Is how.

Speaker 1: 51:46

What is the business impact of a fast-running query If the impact on that same developer is that they have to spend more time writing that query? Yeah, I also saw like.

Speaker 3: 51:57

I thought it was interesting. On the takeaways here what was it? Beware of the database vendor that cares most of all performance. That will slow them down in the long run. And I think I mean it's not exactly the same thing and I think I mean it's not exactly what you're saying, but I guess it's like.

Speaker 2: 52:15

Isn't this DucDB? Isn't DucDB very much on the quite performance.

Speaker 1: 52:21

No. Ducdb wants to be as accessible, well as accessible as this word Is that from your focus? Well, DucDB is something you can easily use.

Speaker 2: 52:29

You don't have to deploy a database or something.

Speaker 1: 52:32

Ah, I see what you're saying?

Speaker 2: 52:33

How do they position them? Because I hear them very much on very nitty-gritty details on how to optimize this performance or that performance.

Speaker 1: 52:40

But then they also say in each version release that they are going to improve this or that in the next version or something Interesting. They didn't try to tinker with us, so it can be that performance as First mistake.

Speaker 2: 52:55

Yeah, maybe it's also a bit passive aggressive of this, because they will probably not be allowed in the very large databases conference.

Speaker 3: 53:03

They're just on the large, or just like big, not even large. Yeah, actually I was looking for this DaBinge mic, but I cannot find this.

Speaker 1: 53:12

The one for Polar is something GitHub has. The first thing. Ehh and Polar is the first thing, this On the Polar's GitHub yeah, they publish it because they're very proud of it. Yeah, yeah, yeah See, and that's written in Rust, so Unbeatable.

Speaker 3: 53:31

Unbeatable indeed.

Speaker 2: 53:32

So, you use this frequently Like you're a Pandas user and are you now a Polar's user? Ehh.

Speaker 3: 53:42

Did you switch, I mean? So If I have to use something today myself, I start a project I'll probably use Polar's.

Speaker 2: 53:51

Okay.

Speaker 3: 53:51

Polar's, the DaB, if it makes sense, if it's easier for some reason but I would. But I think it's like today's, like I mean there's also the zero copy between DaB and Polar's right, because it's all in Arrow, so ideally you don't pay for copying data. Okay, so choosing one or the other doesn't matter too much. But most of the times now, today, when I need to use Pandas, it's like I'm on a team and the decision was kind of made. You know, if I'm on a team with other data science, everyone's using either Pandas, or sometimes it's Spark, because I don't know. So then I'll just stick to that, okay, yeah, why I feel like?

Speaker 1: 54:28

you had a no, just wondering. The Pandas 2.0 is also. Yeah, Pandas 2 is now. You should go for it.

Speaker 3: 54:33

I used it like in the very beginning, but it was a bit finicky still, like, I think, the four, the column types were a bit like, even the rendering was a bit different, and because it's like a different type in the back. So but yeah, I'm sure that it's better now. The thing I don't like about Pandas, to be honest, is there are so many ways to do the same thing. You know like you want to create a column, you can do data frame brackets. Column name equals value.

Speaker 1: 54:59

Right, that's fine, you can also do itteros, and then you can do itteros.

Speaker 3: 55:03

But you can also do like the dfassign, like with the chaining thing, which is also the way I prefer, right? So I feel like there are so many ways to do it and I do think that even Spark or Polars you know they kind of because they have one way and it's a bit opinionated they kind of encourages people to do things the same way. Like you can do, you can have Pandas functions that are justmethodmethodmethod and do a whole bunch of times and then you have the transformation at the end. You can do that, but that's not what most people do. I feel, and I think that's the only thing. Like when I look at it, it's harder for me to read and understand what Pandas code, what the Pandas code is doing.

Speaker 1: 55:39

So you know where. You don't have this in SQL.

Speaker 3: 55:41

That's true. You don't have it in SQL. I like SQL too. I like it.

Speaker 1: 55:46

I don't transfer my data in Python anymore.

Speaker 3: 55:49

Never.

Speaker 1: 55:50

With all the stupid brackets it's too much. Yeah.

Speaker 2: 55:52

Do you prefer Python over SQL?

Speaker 3: 55:55

No, For doing data transformations? Yeah, no, not really. Actually, I think I would say usually my. I think there's a lot of ifs, ifs, ifs, right, a lot of how do you say qualifiers, but I think most of the times that data if you're manipulating the Impendence, because the data is in a database already or something that, like you can query it, right, so then I think, if you're reading that, just use SQL, just query the data as far as as much as you can, and then you and then go for Python if you really need it, but I think most of the times you don't need it. You can do quite a lot of SQL. I agree and I also think, yeah, indeed, I think it's much easier to do something very complicated in Python and I think it's harder to do that in SQL. It's still possible to do something that you cannot understand when you read afterwards.

Speaker 1: 56:48

Same time SQL forces you to make it readable.

Speaker 3: 56:51

Yeah, indeed, I think that I mean there's still exceptions, I feel. I can still see some SQLs like man, what is this? You know, but in general, I do think the SQL forces you to do something that is more readable, it's more organized, it's less flexible, but then the longer you know. So, especially if you're on a large team.

Speaker 2: 57:12

I think I can still do spaghetti SQL code Ha ha, ha ha.

Speaker 3: 57:16

Part says hashtag proud. No, I've seen. I mean, I can probably do it too, but I still think like if I'm working on a team and I'm not super 100% sure the people there, I think if I say everyone does SQL, I rest more assured at night. I sleep better knowing that when I wake up the next day I'll be able to understand any bug in all these things.

Speaker 1: 57:39

It's between what DBT brought and they were SQL world, and so, yeah, you can have these SQL statements with 10 subcubbies and so on. That's still requiring a few percent of most to digest.

Speaker 3: 57:52

But yeah, I think DBT is actually very interesting too because it does there's SQL. But I do think I feel like before SQL was more of a you just do a query, but now DBT came with like a structure you know they can almost like build the pipelines, you have some standards, you can get version of these things. You don't get, you know. So I think I quite like it, quite like it, maybe a quick, quick, quick link to DBT, to DBT. One thing I saw on DBT here is that DBT 1.8 is coming out, yeah, and actually coming with unit testing Indeed.

Speaker 2: 58:32

How does DBT unit testing differ from DBT tests?

Speaker 1: 58:39

DBT tests tests the data. Dbt unit testing tests your codes that's very elegant.

Speaker 3: 58:45

I was expecting my heart to die from Sam too.

Speaker 1: 58:48

It's like my job. If you create macros, you want to test them before you use them somewhere in the data that you need, and this is what the DBT unit testing should solve Every loss of community projects to do so, and now it's going to become part of DBT itself.

Speaker 2: 59:05

Let's say you have a macro, I mean you still need like a fragment of test data. Than somewhere, right, like typically DBT, you run against your an actual database and you can test whether or not an apply is successful and you can pseudo test that way. But you need to do it against your actual database, right? And is it with unit test then, that you have like a fragment of data that you define or you define like asserts that you expect or I haven't looked into the final spec yet that would be coming out.

Speaker 1: 59:35

But this is the page that has it, and I well, they have fun, in a YAML way, of course, to define it.

Speaker 2: 59:48

Here they say they showed actually like for a given input. If you zoom in a little bit, marilo is showing the spec here on the screen that for a given input, a number of rows and you can define also given expected output.

Speaker 1: 1:00:05

But it would be interesting to see if they run this in your database, or does it stay within DBT core?

Speaker 2: 1:00:11

Well, I hope it stays in DBT core because then you can really isolate it. Yeah, it becomes much easier to test it also like setting this up locally.

Speaker 3: 1:00:23

But then like but there are differences on the SQL dialect, right, If you do it locally, how would you?

Speaker 2: 1:00:27

Yeah, that's a fair point.

Speaker 3: 1:00:27

That's a good question.

Speaker 1: 1:00:29

Yeah, but that doesn't matter that much, since you're not testing the execution of your SQL, but the execution of your GINJA. Do you see? So it's DBT has the templating language. Ginja is being used in DBT to make your SQL more flexible, While if you would want to do something on the same operation on 10 columns, you would have to paste in a SQL. That's what GINJA sells for you, and the unit testing feature is not. The goal is not to test your SQL code, but to test what you have created within DBT, within GINJA and so on.

Speaker 2: 1:01:11

So it tests. So if you explain, do interpret that correctly that it tests the correct templating, the execution of the GINJA templating and not the logic of the template.

Speaker 1: 1:01:23

So, as I understood, yeah, that's the case, so if you have macros and so on, but here I also see that there are some data tests with CSVs.

Speaker 2: 1:01:33

I have to better look into it. We're gonna do a deep dive and come back to this.

Speaker 3: 1:01:37

I think this is still coming out right, so, but I think it's cool. I think that's actually something that I was thinking through when I was setting up DBT on a project like, oh, how can we make sure the things are good? So this is something that I was thinking, so really cool, excited to see what comes out of this?

Speaker 1: 1:01:52

Usually, I learn about these things when I have to make changes in the adapters, but this time, apparently, there are no changes needed in adapters. They can do it with everything that they already had. Doesn't us update, need this, and so I was like but I wanted to learn more about it. Well, now I have to dive into this code.

Speaker 3: 1:02:07

Yeah yeah, cool, cool. Do we have a bit more time for you know? Talk about Reddit part.

Speaker 2: 1:02:14

Oh, Reddit Think we don't have enough time.

Speaker 3: 1:02:17

We don't have enough time Reddit.

Speaker 2: 1:02:18

IPO'd. Well, crazy valuation, I think six billion, something like that. Think they opened at and you're putting it on the screen and they IPO'd at $34 per share, but I think they ended the day at something like $50.

Speaker 3: 1:02:37

Really.

Speaker 2: 1:02:39

But let's see where we are a month from now, right?

Speaker 1: 1:02:42

Right, the CEO is one of the best paid CEOs in the big tech world.

Speaker 2: 1:02:46

There is a huge amount of discussion on that, because moderators are volunteers and moderators have not been treated very kindly by the CEO in the past.

Speaker 1: 1:02:57

Really, it's not that long ago it turned black.

Speaker 2: 1:03:00

Yeah, reddit went, quote, unquote, black. A lot of communities closed because they banned more or less all third-party applications and that connects to them, or they not banned them. They asked a shit ton of money to connect, basically, with their API, which, like you're simple, like you're a kind hobby developer that built a cool tool like you simply can't use it anymore.

Speaker 3: 1:03:23

Yeah, I know that I was playing with like fetching data from Reddit real time to try some like streaming stuff and I think I was like that was a lot of questions on. I don't think I'll be able to do this anymore because I need to pay a license. You can use this, you can do that, so I think there is still.

Speaker 2: 1:03:41

You can use it for free if what you do is free for free, and there is no, I think that is the but.

Speaker 3: 1:03:47

then it's like do you need to request it or do you have it? But then if you violate the terms, then you may be is it? Like, is there like a process for me to?

Speaker 2: 1:03:58

You kind of try to stay under the radar.

Speaker 3: 1:04:01

No, no comment.

Speaker 2: 1:04:03

But let's see what it gives to the community had a lot of shocks to the community, I think. With the third party things was a bit of the biggest one they also have. Not that long ago, it was in the news that they have a 60 million deal with Google so that Google can use their content for AI purposes, lm training purposes, which is a bit iffy, right, if I put my stuff on Reddit, and then it's for free and then, apparently for now. The stock market likes it.

Speaker 3: 1:04:33

But do you think that even OpenAI asked, or did you just use the Reddit data?

Speaker 2: 1:04:39

I think we're not allowed to talk about this, but let's see, it's still very early days. So far this year, reddit is going very good, but that was the same with Snowflake and we all know how that ended up.

Speaker 3: 1:04:52

How did that end, bart?

Speaker 2: 1:04:53

It was a bit of a letdown for a time, and Reddit is a company, if I'm not mistaken. I don't think they ever had a full year where they were profitable oh really, if I'm not mistaken, but it can be wrong.

Speaker 1: 1:05:07

But they did like with the done fall of Twitter and X take over lots of traffic. True.

Speaker 2: 1:05:17

This is the right time at least. It's the right time.

Speaker 3: 1:05:20

Yeah, yeah, Cool, that was a quick that are gonna be more dramatic part.

Speaker 2: 1:05:26

No, I'm gonna leave the drama for another time.

Speaker 3: 1:05:28

Ah man, I like when you're dramatic. I like the drama. I mean I think let's do an Isamalatino. You know, it's like you just gotta do it. Cool, maybe a very quickly shout outs. I guess there was a lot of AI stuff that was released this past two weeks.

Speaker 2: 1:05:47

A lot, yeah, yeah, but let's not go into too much. Like we have Grock 1 that was released this is it March 17th.

Speaker 3: 1:05:57

It's the Twitter one.

Speaker 2: 1:05:58

The Twitter one it comes from xai, so Grock is their conversational botish thing. That was woke or no. Was it woke? It used to be ugh.

Speaker 3: 1:06:15

It was the anti-woke.

Speaker 2: 1:06:16

What is it Sleeping? It's very what is the opposite of woke. Combination of racist sexes. Yeah, yeah yeah, but it came out.

Speaker 3: 1:06:27

It was not political. No, it was not political, correct.

Speaker 2: 1:06:32

It has 314 billion parameters, which is significant. It's pretty big. Also means that, like, there are no different versions. It also means that, like, probably no one can run this locally. I think people estimate that you need roughly 320 gigabytes of VRAM, which is significant. I can't run that. Maybe. Also interesting is how this came to be, because it did entail a bit of drama. So Elon has been bashing open AI, and that's open AI is actually closed AI. Are you a suitomanger?

Speaker 1: 1:07:09

is that it? He's a suitom? Yeah, indeed, yeah, he's a suitomanger. There's an active.

Speaker 2: 1:07:13

Well, let's see what comes from it, right? But he's been really bashing open AI on this that they're actually closed and blah, blah, blah blah that they were a non-profit and that they're not for-profit. And then people on Twitter reacted yeah, but Elon X is also not open sourcing and stuff. And then the reaction of that oh yeah, we'll open source next week. So this is why it came to be Okay, I see.

Speaker 3: 1:07:36

This is why it came to be.

Speaker 2: 1:07:39

So we had GROC, we had MM1 from Apple. Mm1, which I don't think there was a major announcement from Apple, but it was a paper that was released on archiveorg, which basically shows that Apple has its first significant LLM model, which took a while.

Speaker 2: 1:08:01

Which took a while, and I think everybody is expecting Apple to be To be To be To 2024, to be Apple's year into big AI slash, llm slash, an actually working Siri. Let's see where we're at. I think MM1 is their today in research and development. There is also talk. So also very recently is that Apple potentially is collaborating with Google to license Gemini so that they can potentially leverage Gemini for their services. Okay, which a bit contradicts this, but maybe they see it a bit as a part towards their own solution in the meantime.

Speaker 1: 1:08:39

Yeah.

Speaker 2: 1:08:41

So interesting to see that Apple is also doing this. Stability AI released stable video. Stability AI, also known from stable diffusion, which is a video generation model, some a bit like OpenEye's Asura. Looks quite cool. I haven't.

Speaker 3: 1:09:00

I don't think it is publicly available yet. Isn't it something like 3D? Yes, yeah, sort of specialized for 3D images right.

Speaker 2: 1:09:09

And if you look at the what they released looks quite impressive.

Speaker 3: 1:09:16

They've always been doing really cool stuff like stability AI, and it's always been open as far as and it's always been open, yeah, yeah. It's really cool stuff.

Speaker 2: 1:09:26

And last thing here on the AI because there's so much, I'm probably skipping like 80 different announcements that have been there in last week a little bit of insights on GPT-4, that apparently it has more than a trillion parameters and, if I remember correctly, 1.7 trillion parameters, which is crazy to think of. Yeah, so, for example, groc has what we just said, which is a big model, which is probably the biggest. Groc is the biggest open model or openly accessible 314 billion, and chat GPT is now rumored to be 1.76 trillion parameters, which is hard to imagine. So maybe just for people that that's because you hear this a lot like what are parameters?

Speaker 2: 1:10:17

So in a You've all seen like the diagram of a neural network where you have these nodes and connections and layers. If you have these nodes in a different, in a single layer, they're connected to all nodes in the next layer and, like the weights that are on these connections are basically called your parameters. So this is a bit like how big is this model, how complex is this model? And this more or less translates to how quote, unquote, smart is this model.

Speaker 2: 1:10:48

That's what we see in practice. The bigger, the better performance.

Speaker 3: 1:10:51

Yeah, I think with the transformers, the, the transformers, the thing is that they figure out a way that they can just keep increasing and you can just kind of add stuff, because I think before there was some issues when you got really big, like there were some limitations, but transformers they just kind of.

Speaker 2: 1:11:05

And a very recent announcement for me, but it leaked before a few weeks ago a recent announcement from somewhere where he confirmed that GPT 5 will be better across all domains, so curious to see what that means.

Speaker 3: 1:11:21

Yeah, me too. Curious, curious, curious, I mean. Again, I feel like we just needed to touch because this is very timely. Also, there's so much happening. It's hard to keep up. 20 new announcements next week. Yeah, exactly, so we just need maybe to put the notes at the bottom of the screen, like with the news anchors.

Speaker 2: 1:11:37

you know, Like a yeah, like breaking news yeah, exactly Like, yeah, this, this, this While we're recording, like all the news that comes in, exactly.

Speaker 3: 1:11:46

I think that would be appropriate. So cool, I think we already went over the One hour mark. I think not sure we have time to cover all the lots of things here, so maybe we have to keep it for another day. No hot takes today, Bart, sorry. Fine, go ahead Bart.

Speaker 2: 1:12:07

I'll let you Thank you, thank you. One hot, take one hot take, one hot take, and it's actually.

Speaker 2: 1:12:11

It comes from Rilla, but I've taken the liberty to release this as a hot take. So we were having a discussion earlier this morning, a discussion earlier this week, about this very, very small open source Python library what's it called? Again Expiring Expiring-lrucash, right, yes, which is an LRU cache, like in FingTools, that can also expire, where Marilo has 11,000 downloads. But that's not what we're discussing. So I was saying to Marilo there was an issue. Someone recreated an issue.

Speaker 2: 1:12:52

I was saying to him this is, the package is still on version 0.0.5 after fixing the book. And then I said maybe that's. I mean that doesn't really represent it. Like 0.0.5 is like. That seems like it's just like. I woke up this morning and wrote five lines of code and it's more than that, right, yeah, yeah. And then so we upped it, and then Marilo made the statement so now it's 0.1.0. So it's, and that actually means that it's stable. Wait, wait, I'm going to take the exact, Because maybe I'm a little bit. He said also 0.1.0 is usually stable. And I thought to myself I'm going to take this as a hot take. Some is here and I'm sure that some also has an opinion on this. So who do we start with to react on this, marilo or Sam?

Speaker 3: 1:13:53

I don't confirm nor deny or any of the information.

Speaker 2: 1:13:56

I can share a slack screenshot on this.

Speaker 3: 1:14:00

So I guess what I? There was a mishap, you know, like you know, in the labs, in the doing of things. You know my hectic day-to-day life I may have, but you can defend it right, like you don't need to.

Speaker 1: 1:14:11

I just think so, first of all, is this the right version number for your package now.

Speaker 3: 1:14:18

What is the right version Right?

Speaker 2: 1:14:21

Like maybe you can explain, or some like the Simver XYZ Like.

Speaker 3: 1:14:28

Well, yeah, so I can take a quick crack at this. My laptop's going to die, so let's see how long I can keep this up, but basically so, this is from Simverse, from semantic versioning, so actually each number means something which is arguably the most used versioning system.

Speaker 2: 1:14:44

Right yeah, I would say so.

Speaker 3: 1:14:47

So usually each number. So there are three numbers here, separated by dots. The first number they indicate is major, and then minor and patch. And major minor, patch means other things as well. Major means that you have incompatible API changes. So, for example, if I have a function that says hi and then I am, the next version of the function is called hiU, that's a breaking change because you need to change the way your code works to work with my package, right. A minor version is when you add a functionality or feature. So let's imagine that I have hi and I still have hi, but now I have a parameter, an additional feature that can say the name and it has a default value. So basically, if you're using my package, it still works fine, still works the same. But now, if you're using this new version, there's a new thing that you can change. There's a new capability of my tool.

Speaker 2: 1:15:30

But it works in a backwards compatible manner. So if you would use the old docs, it should work Exactly.

Speaker 3: 1:15:35

Yeah, it should everything. So it's like the promise is, as long as you don't change the major version, everything behind should be working fine, Right, and the patch basically is like day to day, there's a little bug, something you fix right. So this is what the patch is. So I agree with that. This is also everything how it goes right. The thing for me is like I also heard I'm a lot of statements on the 0.0, right, Maybe I took the liberty as well finding here from Tiangolo, the guy from FastAPI, when is it? Is it here? Yeah, so, for example. So I have the feeling you prepared your argument on this.

Speaker 1: 1:16:09

Yeah, he didn't scroll down on the page where it's constantly might say that's, but I think it's like so this is.

Speaker 3: 1:16:16

I mean again, this is a convention, it's a guideline, okay, but like, what is a breaking change? What is not a breaking change? Sure, things are a bit like it's idea, it's clear, but then once you get to the new degree, sometimes it doesn't get super, you know.

Speaker 2: 1:16:28

So, for example, and you have this example. Tiangolo is the guy behind well, I think it's his getup username, right, but this is the guy behind Sebastian Ramirez FastAPI.

Speaker 3: 1:16:37

FastAPI, typer and SQL model. So very popular Python packages. And this is for SQL model. So even he says here test coverage 97%, I will release 0.1 once I have 100%. So that's actually his threshold Once he has 100% coverage, that's when we release 0.1. There are other packages now, for example I think, if you see, I think great expectations you mentioned Sam.

Speaker 1: 1:16:58

So this is your argument? No, no, no. But what does this conversation about? So he releases 0.1.0 once test coverage reaches 100%. But what kind of versions are here right now, then?

Speaker 3: 1:17:09

Does he follow Samver?

Speaker 1: 1:17:10

Yeah, samver would mean that he would now publish versions with 0.1.0, dash, pre or alpha or something like that.

Speaker 3: 1:17:23

So SQL model it's still not 100%, so it's still on 0.0, for example, he's still at 0.0. And even though he may release features, I'm sorry, he's not following Samver. He's not following Samver Even, like I think you mentioned. Let's see, let's see.

Speaker 3: 1:17:40

Well great expectations on 0.18. If this was an exam and I wrote that down, I would be wrong. But in real life I think some people they are a bit more free with the way they use Samver, like also, for example, another argument why I heard poetry and I have this whole thing on poetry because last week we touched on poetry so I wanted to come back and correct some things that I said that are actually not true as well. But poetry they also. They use the carrot, basically saying I'll accept everything from this major version but not anything on the breaking version. But since then I heard a lot of arguments from people from the Python community that you shouldn't use the carrot. You should really just use the greater than equals, even if it's a breaking change.

Speaker 3: 1:18:27

Because what is a breaking change for you, for your code, depends on what functions are you using, and by allowing more functions, by allowing more versions, it's much easier for the dependency management resolver to work. So by using and again, breaking change is just me as a developer that thought this is going to break your workflow, so I bumped it up. So it's like what is a breaking change? What is not a breaking change? It's not black and white a lot of the time, is it? That's exactly it. That's exactly it. So, for example, there is maybe a function that I'm using that maybe there is something that I thought it was a bug, and when I fix it, a lot of people complain because they're not relying on this bug, quote, unquote. So then, for you, that's a breaking change.

Speaker 2: 1:19:09

I think, well, the bug. Okay, bug is a bit of an issue, but I think the example with the docs if you create docs and you lost your yeah yeah, but it's fine.

Speaker 1: 1:19:22

You lost your power to your computer.

Speaker 2: 1:19:26

If you use generateddocs, that means for version zero, right yeah, or you're in the version zero, zero dot, zero dot zero then you should be able to use what you defined there all the way up to zero dot one. I think that is a good definition. If not, there is a breaking change. Yeah yeah, for sure.

Speaker 3: 1:19:47

But, like you, should be able to use it but.

Speaker 2: 1:19:49

I doubt that that is the case for what we're showing. Fast API right. There must have been breaking changes.

Speaker 3: 1:19:58

For our zero.

Speaker 1: 1:19:59

Zero dot zero is a special one as well. There's a very special one.

Speaker 2: 1:20:02

Yeah, yeah, yeah, it's a bit of a.

Speaker 3: 1:20:03

It is a very special one, yeah but I think in the zero dot zero it's like we talked about semver. The zero dot zero is a bit different because Maybe zero dot zero bro.

Speaker 2: 1:20:11

Like this is zero dot zero. Zero is in that case right, but zero dot one.

Speaker 3: 1:20:15

But like zero dot one, like if you look at the semver on the page that I was showing before, they say that one point zero is the stable right. So basically the idea is like anything that is not one, the maintainers, they are reserving the right to break at any release right. So basically you're saying like, from one to two there's the breaking change. But if you're at zero point whatever, from zero point one to zero point two, you could have breaking changes. Because the idea is that in software development you want to move fast in the beginning and you don't want to be constrained by not breaking anything at every point right At any bump.

Speaker 3: 1:20:50

So it's a my mistake was to say there was zero point one, when in reality it was one point zero, right. And then the idea is that from one point onwards, you know that there won't be breaking changes.

Speaker 1: 1:21:02

You see lots of packages these days that don't but a maintainer takes a very long time to publish a one dot zero. I don't think that's a good evolution. I fully agree, yeah.

Speaker 3: 1:21:14

I mean.

Speaker 2: 1:21:15

Better, to more quickly. Yeah, it's the major it's like they don't want to commit to what they have. I think having a major release creates a bit of trust.

Speaker 3: 1:21:24

Yeah, but I still see, for example, rough. Now I think they're on the one or maybe zero point, one or something, but even to then it was like zero point, zero point two hundred eighty seven. It's like crazy.

Speaker 2: 1:21:34

But maybe because we need to close this, we can summarize it two ways, right? If you're the creator of rough, or if you're the creator of fast API, which has 69k stars, like you do, whatever the fuck you want. In all the other cases, we use the definition of semper.

Speaker 3: 1:21:51

And for expiring LRU cache. It's not a good compromise. So that's what Bart's saying, that I need to bump to one Exactly.

Speaker 1: 1:21:58

So that your users can be confident to take a user package.

Speaker 3: 1:22:01

But then I guess, even like the first time we published it, we should already put it at one, because we published it once and we didn't touch on it for like years. Like it was already pretty stable and also even the project that is expiring LRU cache. Are we going to have breaking change there? Like, like, isn't that just stable already? Like how.

Speaker 1: 1:22:23

I don't think anyone predicts the breaking changes.

Speaker 3: 1:22:25

No, no, but I mean like the first release, because we released at zero points. Zero point, zero point one. I guess Right, but it's like we could have might as well just release already the first release to be the one point, no point.

Speaker 2: 1:22:40

I feel it takes us too far. But we added actually a feature, now that I'm not 100% sure how far back in Python it goes. We dropped support for a number of Python versions all the Python versions. It's already breaking change. And I'm not sure how. Like this new feature because it actually exposes something in the line. If it's in all the ones, for example, you don't test them anymore.

Speaker 3: 1:23:00

Yeah, anyway, anyways, maybe just to wrap up as well the versioning. I think, like Python you mentioned, python does not follow Simfer. Python 313 is not Simfer.

Speaker 2: 1:23:09

If you're a Gitter, he'd do whatever the fuck you want.

Speaker 3: 1:23:11

And there's also the if you're Dutch right Bart.

Speaker 1: 1:23:15

Let's make it broader. There is Python code that you wrote in 3.0, which it doesn't work in Python 3.13,.

Speaker 3: 1:23:21

you say yeah, you can. No, I think. So I think there are some things that they did.

Speaker 3: 1:23:28

Like there was on the whole paper about removing that batteries, that they're going to remove some stuff that stands in the library. Okay, yeah, yeah. But the but Python 13, like the 13 and 14, is not because there is a new feature. It's like they have a calendar almost Like every yeah, and I think again, like if you're using the standard library, like if you remove that batteries, right, like it's very like edge, edge, right. I think the Python 4, if you had a very real, truly breaking change would be if they have different keywords, like the print statement. Now it's like I think they did it from Python 2 to Python 3. Like before print, it was like print space string.

Speaker 1: 1:24:12

It's also because Python has its history at a difficult upgrade from 2.3.

Speaker 3: 1:24:16

Yeah, and that's why they, like Godot said he's never going to see sharp or something.

Speaker 1: 1:24:20

just follow Sam's words, it should.

Speaker 3: 1:24:21

Yeah, but Python doesn't have a patch. I mean, actually they do. They do have a patch version even though, but like the release is on a calendar pretty much and there's also the calver right, so you can actually really make releases based on calendar. I think black used to do that.

Speaker 1: 1:24:37

Or a strange Ubuntu release things where they do the.04 and oh, with a cool name.

Speaker 3: 1:24:41

Yeah, yeah, yeah, maybe need a cool name.

Speaker 2: 1:24:44

They do the year month, right?

Speaker 1: 1:24:46

No, it's just increasing, but you always have.04. There is 04, and so I think it's the year base.

Speaker 2: 1:24:53

I don't know. It's no, I'm, I'm, I'm. I do think I'm correct. So the latest one is 23, which is the year.10. I think they typically release in October and in.4 April.

Speaker 3: 1:25:10

You know Rust. They also committed to never have breaking changes, just a latest code name for Ubuntu is Mantic Minotaur.

Speaker 2: 1:25:17

I think that is cool, I think every version, every major version.

Speaker 3: 1:25:24

But I feel like that's a code name, but I feel like that's for operating systems Like Mac also has names, right, that's just lame, I'm limited to that We'll do it, we'll do it.

Speaker 2: 1:25:34

Why.

Speaker 3: 1:25:36

All right On that. I think that's all the time we have for today. Thanks for whoever stayed with us this whole round. Thanks everyone.

Speaker 2: 1:25:43

Thanks a lot, Sam, for joining us. Thanks, Sam.

Speaker 3: 1:25:46

Thank you for having me Talk to you later. Talk to you later.

Speaker 1: 1:25:51

You have taste in a way that's meaningful to software people. Oh, this is it.

Speaker 2: 1:25:56

Hey, hey, hello, let's not close. I'm Bill Gates, so that's an intro. I have one more thing I would back name. You don't pause it Okay, oh, wow, okay, okay, we can restart the auto.

Speaker 3: 1:26:05

Yes.

Speaker 2: 1:26:06

Well, it would be really good, but it would be really happy about.

Speaker 3: 1:26:08

I have another, just quick thought.

Speaker 2: 1:26:09

We need the you know the disk they go you know when you change that we need that on the on the soundboard.

Speaker 3: 1:26:15

Carry on.

Speaker 2: 1:26:17

But it would be really happy about if we can like use a quote of Sam in the intro. So like, with like super happy intonation like I'm, so I'm very happy to be at the data topics podcast.

Speaker 3: 1:26:30

Oh yes.

Speaker 2: 1:26:31

Maybe we can do this here live. So no, it doesn't have to be like let's fabricate it, let's fabricate.

Speaker 3: 1:26:36

I think that is no, you know no, it's nothing, we can just just say like just repeat after me, let's fabricate it.

Speaker 1: 1:26:44

Let's fabricate it.

Speaker 3: 1:26:45

Yes.

Speaker 2: 1:26:48

You have taste in a way that's meaningful to suffer. I'm Bill.

Speaker 3: 1:26:53

Gates.

Speaker 2: 1:26:58

I would. I would recommend TypeScript. Yeah, it right.

Speaker 1: 1:27:02

Include your code for me obsession with bars of soap, and this is well.

Speaker 2: 1:27:06

I'm reminded it today. A bus, yes, but the bus I can't like to say to Congressman iPhone is made by a different company, and so you know you will not learn Russian.

Speaker 1: 1:27:16

Well, I'm sorry guys, I don't know what's going on.

Speaker 3: 1:27:20

Thank you for the opportunity to speak to you today about large neural networks. It's it's really an honor to be here Rust, rust Data topics.

Speaker 1: 1:27:27

Welcome to the data, welcome to the data topics podcast. Let's fabricate it.

People on this episode

Bart Smeets

Host

Murilo Cunha

Host