whisper medium - big data fall 2022

https://youtu.be/b7h7O-_o81k

transcription with whisper - medium

 All right, welcome everybody in the lecture hall in the CBB building. Welcome everybody who joined us on zoom. So this is the first lecture of the semester for big data. It's so nice to see you all here with a full lecture hall. We have been missing that so much in the previous two years. It feels good to be to be together. I will continue to provide the hybrid option for some of you who want to join on zoom. This will be throughout the semester. You can do this. We will record the the courses. This is also why the green screen is here because when you watch over YouTube, then you have everything on your screen. So yes, we retain the few lessons from what happened in the past two years and there are a few things that we can keep for the future. So what are we going to do in this lecture? We are going to learn techniques and skills in order to query amounts of data that can be gigantic not just gigabytes of terabytes, but even petabytes or even exabytes of data. So we are going to learn how this is actually done in practice, what are the principles of this, what are the technologies that were invented in the past few years. And on top of that, we have another problem that came together with the amount of data that is stored is that the data is messy. It's a huge mess. It's not like the super nice structure tables we had in the 70s. So we'll speak about that a lot. How to deal with large amounts of messy data. And one of the projects that were which we a lot was done in the past few years. Now we have something working in the Jupiter mode. And you have a lot of Azure probably in December or something, where you can deal with very large amounts of data yourself with this. So we'll see plenty of super cool things such as, had up my produce, who are ready. Who never heard of that. So we learn about that. That's very nice. I see enthusiasm. How about Spark? Who knows back? Okay, who doesn't know spark never heard. Okay, so these are some of the cool technologies that have emerged and will learn the secrets of how they are actually used. But in order to start the lecture, I always do something different in the first week instead of just starting diving into the actual material. Why we are doing all of that. And especially in terms of scale just to give you a feeling of what have been. When you think of scale, you might also think of the universe and the exploration of the universe, right. So this is not exactly the big bang. This is what's happening at some right, but it's not so far away from actually reproducing what's happened in the big bang. But if you look at our scale. So, so one of us. So one meter 85. Let's say so one to two meters. This is the order of magnitude. This is the scale we see every day. Right. So two miles to the earth. The circumference of the earth is 40,000 kilometers. That's that's a lot. By the way, you might be wrong ring. Why is that around number. It's actually not by chance that it's around numbers that this is how we actually define the meter back then. Before it was defined directly with the speed of light with the second. So that's actually right around number. And we also convert that because we have units there. So we can also consider that it's 40 mega meters. Then we'll continue to zoom out into the solar system. So around us. 150 giga meters. So you see another prefix there. All the way to Jupiter. We are ready with a thermometer. The entire solar system. One petometer. So here with the zoom out, this is our galaxy. By the way, it's incredible. The time that it took us as humankind took us millions of years to actually figure out that the little things we see there are actually just what we are in. As well. So the Milky Way. So this is here a thousand exam meters, which you will also call it a zeta meter. And then this is the the zooming out as much as we can do it. And then we are at 138. Your time meters. Right. So this is the background noise as we can see it might be that there's even more behind that but we just don't know we cannot see it. Right. So this is it. And here I didn't correct for the actual for the actual expansion of the universe because actually to be even bigger if you consider that it expanded. And what we wanted to do with this little exercise is to show you that we have prefixes and there's very large scales and it took us millions of years in order to reach these scales. And when we look at these scales and go back all the way to the big bang, we actually end up studying particles in the Philippines more. This is what the physicists are actually doing. And then we're going to be able to see how much of the data is going to be spent in big data as I would argue it's not so much different. We will spend a lot of the lecture looking at very, very large scales at looking at at at huge amounts of data. But we will also spend parts of the lecture on data modeling at very, very small scale why because data is messy. So we also need to learn how to deal with small amounts of data that can be heterogeneous. And then we have to compare data science with physics in many aspects just like physics is the study of the real world. We just run experiments in order to see how the real world is is behaving. And data science is the same with big with data we just manipulate data and and and see what what happens. And then we have to ask questions for you. So we are going to be using through that with the lecture something called the clicker app of it. It's done right here. This is the address. We have many, many ways of doing that. You can go to your browser at this address right here. And then we have the touch. Or because I know that's not as new cool fist smartphone. So you can also download and install the app called the adwap free from whatever download center there is for your operating system. Right. And then connect right there. And here I need to do something with sharing my screen. Hopefully it will still. So I think I need to go right there. And I think that this is what I want to do ask you. I would like. Just collect a bit of data on your background. So now you computer scientists. We might have data scientists as well. Maybe a few computational biologists with computer science kids. So here that's an opportunity to test the app. You know, I'm giving you enough time so that you can make sure that you can use it. We already have 61 on the internet 68. And we can do it from the lecture hall or you can do it from zoom. It works just as well. Oh yes, let me put again the slide. Can you see this? Do you like? You have the adwap aduap.app1.ethz.th. But let's see if the number seems that the number is slowly converging. Is anybody having difficulties of hasn't hasn't managed to connect yet. You know what happens when you do things live. This is as expected. So the first rule is don't panic. So you see it's working again. And now I'm landing exactly where I was. Okay. And that works. Fabio, you can hear me again. And you can see the results being very well done. And that's the effect. So at least now you see that we are live and that is is not actually prerecorded. Well, you obviously know right but the people on zoom at actually not know. All right. So. Indeed most of you are computer scientists and data scientists, because this is a master's level lecture for computer scientists and data scientists. And that's the point because it was becoming very large. So now we have big data for engineers for other departments, which is offered in spring. So that would be next year. And information systems for engineers, which is the equivalence of the bachelor's level database management lecture right where you learn relational database and SQL. But again, for those of you who could not register in the system because you are not in the in the computer science and data science program that should be very few of you, hopefully, just so you know that there are these two lectures for you for the other departments and you're absolutely welcome to. To attend this and where we are right now is now just have to lose the habit of showing directly so where we are is right here. Okay. So now I would like to say a few words about you know how we do science for a very long time we've been doing mathematics and physics. What's the difference between mathematics and physics well you can easily land into a very big philosophical debate. And everybody agrees with me. But basically mathematics I tend to say in model logics that mathematics is necessary in the sense that we can do it from an armchair right to just sit and think. You can come up. And so this is called contingency and this we have no other way than through experiments we have to just play around with the world and see what happens and then figure out everything that contingents. And then we got computer science that added more to the mix and in particular this is what we can do with the machine and we can also we came up with an entire field of science with theoretical computer science and so on touring machines, say, automata and so on and so on. And this is computer science and data science I see as being this last missing piece that is nicely at a sweet spot between physics and computer science why. And then we have to study the data science it's to study the data and understand the world as it is we could not do it without data we have to collect data and work on the data it is in that sense that it is epistemic and contingent just like physics we collect data and then learn about the world as it actually is. So this is amazing. And the reason we are doing that is that we want to learn about the actual facts about our own world and how it works. So people knew actually we are already wise a long time ago thousands of years ago that a good decision is based on knowledge not numbers so you need to actually manipulate all of these numbers and all of that data in order to make sense of it. So I would like to go ahead now and continue with the history short history of databases how we start and manage data I start with the pre history of databases and actually it's extremely long ago probably even more than you might have expected in the database lecture thousands of years ago. We were already storing data and managing data how in our brains it's people who would just observe and you know keep track and remember what's going on and then they would just spread it to everybody else so to their children, grandchildren, but also you would have people singing from city to city in order to explain what the what people are doing, especially the kings and empires and so on you know to make sure that the knowledge is spread. But there is a problem with that is that information can get distorted and can get lost because our brains are not fully reliable and we also need to make sure that it retains over centuries and even millennia right so this is why we needed to actually solve a problem and the problem has been solved thousands of years ago with the invention of writing writing was the first time that we actually figured out that we can store data with information on data. And we have tablets that's how it was done first in a way that this is preserved through thousands of years and in fact we still have some of it today and this is why we can actually do history this is what most historians would consider the start of history before that we kind of lost everything we don't have anything left but this here gives us a way of actually learning about what our ancestors had been doing. And I'm going to show you something even more super cool and awesome this is this thing here. So this is a tablet called the limpton 322. I don't know what you think but this looks a lot like some sort of tablet not a clay tablet smart tablet with a spreadsheet software installed on it. And the data was structured as a relational table this is a relational table relational tables are thousands of years old and there's anybody know what there is in there what is stored on this tablet. This is actually information on you know the pig or us theorem right when you have a triangle with a right angle and then you have like three four five is an example of possible an integer length that you could have well this is just a list of some of those that were known at the time. And they need that well this is mostly for keeping track on the actual land who owns what when there is a flood and then we need to be able to reproduce the actual shapes of the land right. And this is also used in order to get right angle in construction and so this was starting there so that people would have a database this is a database somewhere where these numbers are actually stored. And if you. All of that and understands what there is in there I think it's in base 20 should look closer everything is in base 20 in there. Everybody is staying the same and me too did I say. Oh really because I have. That seems that way if I was working fine but if there is the issue and it repeats will will switch definitely to land is it back to normal now or you have difficulty hearing me on zoom. Alright otherwise I'll just continue and let me know if the if the difficulty is persisting and the. Difficulties it cuts off okay. Oh once in a while okay so let's see if the frequency increases I'll try to continue and then we'll see if there is any issue that I can maybe try also to reconnect the the wife or something. Alright and maybe during the break we can try to to see if we get a land land connection working. Alright thank you very much for letting us know. Okay so this is the writing then came the printing what is the problem that printing source it solves the problem of making duplicates and copies of the same thing. Because if you want to spread books for example you need to replicate them back then you needed people to do it manually they would sit for hours and just make manual copies of everything typically monks were doing that. With printing with a printing press this was a way with this you know free printed characters that you just. The paper and with this. So this was the 16th century more recently computers were invented by the way there is one in this building I think on the on the e floor below that's one of the early ones we had in the department a craze super computer it's worth the visit if you haven't seen it. But computers automated and meetings faster because then we could start processing information we could start manipulating information we could start querying information in order to to be even faster so this was already kind of the beginning of the true revolution. And in the sixties we had five systems like directories and files but back then this is how people would deal with data they would basically manipulate files on the disk and directly deal with the files read files and output files. So this is how it was done and then then then then something happened in 1970 somebody called Edgar code remember that name 1970 this was the beginning of database history where the brilliance idea that Edgar code had is that people should think of the data on an abstract level in terms of data shapes and model. So tables are quite intuitive I showed you a plate that is thousands of years old everybody understands tables so that was a natural thing to do to say now we should isolate the user from everything that's too complicated on the physical level they shouldn't have to deal with files and directories and syntax and so on and we'll just expose everything as tables with rows and columns and that's it. And the principle behind that was called data independence we shield the user from what is below and I'll come back to that in the second unit probably starting today and then tomorrow on on what has been done in the seventies and then in the two thousands and this is where we arrived to big data this wasn't enough that will explain a lot why this wasn't enough and more technology scheme so key that you stores or one of them triple stores with graph databases. Column stores which are very very large and sparse relational tables and document stores which basically store collections of trees right so this is what was known as no sequel not because we don't like sequel but not only sequel meaning that there is something beyond sequel. Right so that was a quick history of databases as as it has evolved throughout the centuries and now I say she more words on big data so it's a buzzword right with buzzword it's quite hard to define so i'm going to try i'm going to tell you what I think big data is so first what we probably can say is that it involves a lot of proprietary technologies because this is where it was invented. It's not that i'm doing any advertisement for any of these right I know i'm not advertising i'm just saying as a fact a lot of these technologies actually came from companies rather than universities probably because there's a lot more financial means in order to have very large clusters that that you can use in order to store data right so this is how it happens. And we are going to look of course into what was done but for many of them will use an open source equivalent like had who is an example of open source equivalence spark is actually open source data bricks would be the company behind it. And so all right so we'll see some company names. There is also a way on the internet now there's such engines for actually specifically looking at data that also exists now but the best approach to big data is to look at it from the perspective of the three views the three views are volume variety and velocity. The so we are going to discuss them volume variety and velocity in the next few minutes and go more into details so first volume then we'll do the rcd velocity. Why volume well an answer that you hear from a lot of companies is because we can. So a lot of companies especially maybe let's say 10 years ago or 15 years ago it began started realizing that with so much space available for so cheap why would we delete any data. Let's just keep it just in case that was the mindset back then right so let's keep all the data we might need it in the future who knows so of course as you know things have changed right now now we think a lot about data protection about who has the right to store data there is the GDPR in Europe and so on right so just because we can. It doesn't need to be a good idea right but technologically we can we can store the we can store and keep all of that data and that works across all levels right the infrastructure the data centers as we will see the hardware the software that runs on on on it and the technology that was invented I include here maybe machine learning artificial intelligence and so on another reason is that data has value. Something actually of data is the new oil just a new resource that now we can we can manipulate and and store and use and we can extract value from it right so this is also why data is. Of course there's a lot of considerations that out of scope of this course because here i'm teaching you the technologies it doesn't mean that you cannot think critically of course the data carries value it has an impact on the fact that for example when you have three products offered over the internet that are collecting your data right and and this is the this is a reality to also think of I recommend the course for example big data low and policy which is which is a digs social science in perspective course. Of course that talks about this kind of topics from a legal perspective from an ethical perspective you know just because you can do it doesn't mean you should do it right. But as I said in that course we will focus we will focus on the on the actual technologies and and and that that allows to do these sort of things. Another aspect is that the utility of a joint data set is higher than just the sum of the two it's the power of the joint joints are extremely useful i'll come back for those of you who might not have the joints. In the section to joints are super useful when you want to actually cross multiple data sets in order to link records with other records and this has a lot of value it's also very expensive I hope that in this sector you will understand that. A joint is something we try to avoid in some way when we deal with large quantities of data in particular in terms of complexity in terms of big oh or n is what we love in big data everything that's or n that's something the sort of things we can spread on clusters and so on. And as you're above that maybe n log n kind of but as soon as it starts being quite right to exponential or my right so linear is what we love and hopefully you will develop a feeling of what the complexity the algorithmic complexity means in big data. Another aspect of data is the collection of complete data sets it's particularly useful for the website that claim to have a complete data set and have a complete search for example to find. To find hotel rooms and so on and so on so completeness is also another aspect of of of data basis and of course this is into our print with the fact that we can actually store all of that data if you think of the data like a social network with billions of users. And so if it's on a laptop the list the list of everybody on earth fits on a laptop can compute with in terms of you know gigabytes maybe terabytes but it fits so this is actually a small small database if you just have it as a list right so it fits. In terms of scale they've already explained to you the prefixes this is my first assignment to you if you haven't done so already i'm asking you to learn this by heart all these prefixes here kilomeg a gigaterapeta extra. It's just the powers of 10 but adding 3d zeros every time these are standardized international units right everybody agrees on them there was a standardization body who deals with that as you can see it's been done in batches right not all at the same time because you see that kilo was very early mega giga and then terapeta extra probably were invented at the same time why because tetra pentah hexa this is the Greek way of saying for five six. And if you count here there's four tetra 123 for groups of zero 12345 pentah and six groups of three zeros here exit that's our remember right so tetra pentah hexa give you terapeta exit and zeta and you're probably came later they maybe the new people who gather say okay let's start with the end of the alphabet we didn't use Z and while yet right. So this is how these two were added and then you might think is that enough well it's remarkable that this is enough to define the like the length of the universe this is enough but I think there's already physics paper out there where they basically run out of zeros so you can guess how they continue they basically continued with w x and so on they didn't give any name to that but they just use w and x as the way to express even more zeros but this hasn't been standardized yet. Alright so you must know this by out super important because in big data you you will need these units all the time but typically with bytes right when you talk about the bytes. So I have another question for you I'll switch again. To the other screen which is to ask you what you consider to be big in terms of the amount of data when you have a one terabyte one petabyte when exabytes. By the way people on zoom was it good the connection in the past minutes could you hear when. So I'm not saying anything back yes it's okay okay awesome and in the worst case you know don't worry because I have many many recordings of previous years so in the worst case that the recording doesn't go through or whatever there is a problem I can just reuse some recordings of previous years so so don't worry about that we just will have solutions. So this is what school with the first week is that we get we get these large numbers over there hopefully you continue to come in the next few weeks this is actually what I hope. Alright so many of you say petabytes and I would actually agree with you I think that petabytes is a good number for big data why I would say because gigabytes and terabytes fit on a single computer petabytes is when it's not on very enough you need to go to a cluster. And the exabytes yes but this is maybe huge rather than big because exabytes I would say is almost the scale of humanity at least it was recently but now I suspect that some companies actually are reaching the exabytes sky exabytes so this is this is already you know the scale keeps shifting in terms of what we perceive as big right okay so let me come back to the slides if I have prepared. Everything correctly we should every time land exactly where we left. Okay so you see you should be impressed actually here by the progress that we've made because going from you know a single computer to clusters and so on and maybe maybe Zeta or your tabites at the scale of mankind human kinds it's actually going from our scale to the entire visible universe right so that and that was done in just a few decades this is what's what's incredibly amazing. Just a couple of years one two three years nowadays we generate as much data as the entire my human kind since the very beginning in just three years as much as everything since the very beginning this is extremely impressive the the the the exponential growth that we have here in generating all of that data. So we have a large system of units just checking the time for those who love powers of two it was a bit of a mess at some point we use powers of two or not I'm not asking you to learn that by heart but just so you know that also exists right if you want to actually mess two to the power of 10 20 and so on there's a higher deviation maybe for the larger ones but that exists I'm not asking you to know that by heart usually we talking kilobytes megabytes and so on it's simple okay now the shape of the data. Data has shapes table you already know we've known that for actually thousands of years but data can also be shaped like a tree we will see tree shape data because this is what messy data looks like denormalized as we will see this is XML Jason Yamel and many other ways of doing that even data frames have you heard of data frames from us. Yes so this is it also feeds their data frames can also be seen as collection of trees we will see there these are collections of valid trees valid Jason but we will come back to that we have graph databases which we will also spend when we come at the end of the semester we have data cubes we also talk about data cubes at the end and we have text text is in a different lecture that's in do we have information retrieval for more students here took information retrieval last semester so we actually deal then with that. Six months ago in information retrieval this is how we deal with text but basically there are five fundamental shapes tables trees cubes graphs and text these are the fundamental shapes of data is extremely important to understand the shape of your data sets because this is what is going to boost your productivity and the performance of your system if you don't pick the right shape this is going to be super slow both in productivity and performance next velocity the third V so we just keep generating data all the time right just to give you an impression of why actually we did big data i'm going to do a quick thought experiments on what has happened over the past few decades let's talk about capacity throughput and that the current city of data you can see the size of your hard drive for example terabytes eight terabytes then the throughput is the speed at which you can read it how many bytes per second can you actually read from your storage media hard drive nowadays and finally the latency I can actually make this go away the comments of zoom should go away all right so I see that some people on zoom it's still cutting from time to time right so we will see if we manage land during the during the break and otherwise I share last year's recording make sure that you have the I was saying mostly the same the same things okay when I keep updating it's not the same things I keep making the lecture up to date of course all right in 1956 this is what we had that was the first commercially available hard drive how big was that we look at the dimensions it's like you know this this big right enormous and we fit in there an amazing five megabytes of data can you imagine this is enormous we can read it at 12.5 kilo bytes characters per second and the latency was 600 milliseconds but of course you need time in order to move the reading head in there right okay this is what one level looked like so this is what rotates in there and then then you just read on that so that was the the IBM ramac 350 in 1956 this is what we have today well you can barely see it because it's so small that we need to actually zoom in DC not just that I could find 26.30 start DC HC 670 it's right there 26 there right of data the throughput 250 megabytes per second the latency of few milliseconds at the dimensions like this fits in the hands so now I have another question for you time show you actually experience that you could be on our files the progress bar typically you know how it goes with the progress bar it goes all the way to 99% and then it's there forever right but basically it's not linear why do you think that is is it because fights have different sizes because the progress bar doesn't refresh regularly and the transfer is actually happening at the uniform rates is it not true your lucky enough that your laptop always shows uniform progress bars everywhere or is it because CPU usage may vary let's have a peek at your own cells we have two competing on cells right what I think I would say it's not going to close there you go it's actually reset everything when anyway you you saw it right so it's the fact that the files have different sizes here's why when you have large files the bottleneck is in the throughput how fast can you actually read the bits from the drive or copy over and when you have many small files it's no longer a throughput issue it's a latency issue is that at every file the head of the hard drive has to move to a different place so this is the reason why if you measure your your the throughput rate it's going to be regular for large files and for small files you will have the feeling that it's super slow just because you just keep jumping all the time all over the disk to fetch the new or file so this is the main reason why it behaves in that way and this idea of throughput versus latency is going to keep it keep us busy busy for weeks we are going to talk about that for cloud storage for hfs had to distribute it file system and so on and so on it's going to be central in the way we think the throughput versus the latency okay let me move over to the slides I think it's the fourth right okay so this is the progress made in capacity this is the progress made in throughput yes yes it's right there small but it's right there and this is the progress made in latency of course latency you want to decrease it not increase we are in big trouble big big trouble because what happens I can show in the logarithmic scale but it doesn't make it any better imagine with a book that the capacity is the number of words the throughput is the speed the number of words we can read per minute and the latency is how much time you need to get to the shelf and actually pick the book right or the robots at it that actually get the books you should visit the library by the way it's very interesting to see the little robots working around and not working but sliding around anyway if you divide this by this you can compute that it takes you 10 hours to read a book today of that size and with that speed but now if you go to the future and imagine that the books in two centuries have evolved exactly like hard drives in the previous 70 years right this this has happened then this is the new size of the book two centuries and this is the new speed this is the problem now it takes you 11,400 to fill has massively increased this is why we cannot just keep going with the same technologies it just doesn't work it doesn't scale we have a big problem right so this is why we have this discrepancy right here so what do we do we parallelize so we are going to parallelize a lot in this lecture we are going to parallelize so imagine now you you spread it all over the planet every human being picks one and reads it now you can do it in 10 hours right so you solve that with parallelism so we are going to be doing that with clusters of machines technologies on top of these of these clusters all right and the other here so of course there is a discrepancy right here that we solve with parallelism this right here the parallel the throughput versus latency is also very important this we solve with batch processing processing is the idea that instead of doing it one by one you group them in thousands for example and you do it a thousand at a time this is how you solve this problem here but we'll come back and over and over and over again during the entire semester on these two things do we have a question about the connection yeah yeah no we really have to deal with that because there's there's something we need to to solve with the Wi-Fi in the connection okay so this is what I conclude about big data it's all these technologies that help us to analyze data solving that issue of the discrepancy between capacity throughput and latency this is the problem that we are going to solve in this course okay so it's everywhere in the sciences at some these are enormous numbers of the number of collisions per second they have 10,000 servers they have hundreds of thousands of course and it keeps growing all over the place they produce 50 petabytes of data every year they even throw a lot of weight they don't even record everything right but even what they keep is extremely large the same goes for astronomy we try to map the entire sky every single star we can find or or objects that we can find and keep track of that this is also enormous right billions of objects with the Sloan Digital Sky Survey the last phase ended in 2020 and now it's already continuing with the next phase right so there is a paper actually I will give it to you as an optional reads that that that we actually contributed in in our group that was in collaboration with physicists actually around some data and we try to understand why people in order to analyze the collisions of the particles why are they not using SQL because SQL is actually perfect it solves all of the data independence problems make it makes it more convenient and so on and we try to analyze why that is the case and so they are interesting insights in there and it's all related to the sort of things we are going to get during semester that their data is part in SD the data frame like and so on and so on but if you want to go ahead and read of course you can you can have a look at that okay DNA is also the kind of data with sequencing that we analyze so a very large number of pairs in in in our body you know we are also making progress in in gene editing and so on and so on we can even store data on DNA that was also done by some people formally at DTH they manage to store data as DNA pairs and execute relational queries on that directly on DNA and this is amazing because it shows again data independence you can do it on your laptop on a cluster or on DNA or in a clay tablet that works to it so it's lower but it works as well all right so we are almost for the break I think we can do the break maybe 15 minutes and then I'm almost done with this introduction part then I tell you about the lecture scope I'll introduce you to the T18 some of them are here with us and then we'll move over to a section about you know a brush up of SQL and relational databases right so let's take a quick break 15 minutes and I'd see you at quarter past three for the continuation of the lecture thank you very much