Advertisement
pszemraj

whisper small - big data fall 2022

Oct 9th, 2022 (edited)
143
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 36.07 KB | None | 0 0
  1. https://youtu.be/b7h7O-_o81k
  2.  
  3. transcription with whisper - small
  4.  
  5. All right, welcome everybody in the lecture hall in the CBB building. Welcome everybody who joined us on zoom. So this is the first lecture of the semester for big data. It's so nice to see you all here with a full lecture hall. We have been missing that so much in the previous two years. It feels good to be to be together. I will continue to provide the hybrid option for some of you who want to join on zoom. This will be throughout the semester. You can do this. We will record the the courses. This is also why the green screen is here because when you watch over YouTube, then you have everything on your screen. So yes, we retain the few lessons from what happened in the past two years and there are a few things that we can keep for the future. So what are we going to do in this lecture? We are going to learn techniques and skills in order to query amounts of data that can be gigantic not just gigabytes of terabytes, but even petabytes or even exabytes of data. So we are going to learn how this is actually done in practice, what are the principles of this, what are the technologies that were invented in the past few years. And on top of that, we have another problem that came together with the amount of data that is stored is that the data is messy. It's a huge mess. It's not like the super nice structure tables we had in the 70s. So we'll speak about that a lot. How to deal with large amounts of messy data. And one of the projects that were which we a lot was done in the past few years. Now we have something working in the Jupiter mode. We have an Azure program being December something where you can deal with very large amounts of data yourself with this. So we see plenty of super cool things such as had up my produce who already had of my produce like had up. Who never heard of that. So we learn about that. That's very nice. I see enthusiasm. How about spark who knows back. who doesn't know Spark never heard. Okay, so these are some of the cool technologies that have emerged and will learn the secrets of how they are actually used. But in order to start the lecture, I always do something different in the first week, instead of just starting diving into the actual material why we are doing all of that. And especially in terms of scale, just to give you a feeling of what has been, when you think of scale, you might also think of the universe and the exploration of the universe, right? So this is not exactly the big bang, this is what's happening at some, right? But it's not so far away from actually reproducing what's happened in the big bang. But if you look at our scale, so one of us, so one meter, 85, let's say, so one to two meters, this is the order of magnitude, this is the scale we see every day, right? If we zoom out to the earth, the circumference of the earth is 40,000 kilometers, that's a lot. By the way, you might be wrong-growing wise that around number, it's actually not by chance that it's around number, is that this is how we actually define the meter back then, before it was defined directly with the speed of light with the second. So that's actually why it's around number. We can also convert that because we have units there. So we can also consider that it's 40 mega meters. Then if we continue to zoom out into the solar system, so around us, 150 giga meters, so you see another prefig there, all the way to Jupiter, we are already with a thermometer, the entire solar system, one meter. And here with the zoom out, this is our galaxy. By the way, it's incredible the time that it took us, as humankind, took us millions of years, to actually figure out that the little things we see there are actually just what we are in as well, so the Milky Way. So this is here 1000 examinters, which you will also call a z-meter. And then this is the zooming out as much as we can do it. Today we are at 138 year-time meters, right? So this is the background noise, as we can see. It might be that there's even more behind that, but we just don't know, we cannot see it, right? So this is it. And here I didn't correct for the actual expansion of the universe, because actually it would be even bigger, if you consider that it expanded. But what I wanted to do with this little exercise is to show you that we have prefixes and there's very large scales, and it took us millions of years in order to reach these scales. And when we look at these scales and go back all the way to the Big Bang, we actually end up studying particles in the infinitely small. This is what the physicists are actually doing. And in fact, in big data, as I would argue, it's not so much different. We will spend a lot of the lecture looking at very, very large scales, at looking at huge amounts of data. But we will also spend parts of the lecture on data modeling at very, very small scales. Why? Because data is messy. So we also need to learn how to deal with small amounts of data that can be heterogeneous. So I like to compare data science with physics in many aspects, just like physics is the study of the real world. We just run experiments in order to see how the real world is behaving. And data science is the same with data. We just manipulate data and see what happens. I have a question for you. So we are going to be using, through that, the lecture, something called the clicker app of VTH. It's done right here. This is the address. So you have many, many ways of doing that. You can go to your browser at this address right here, eduap.app1.ethg.ch. Or, because I know that there's new cool physics smartphones, so you can also download and install the app called the eduap3 from whatever download center there is for your operating system, right? So you can connect right there. And here I need to do something with sharing my screen. Hopefully it will still. So I think I need to go right there. And I think that this is what I want to do as you. I would like to just collect a bit of data on your background. So are you computer scientists? We might have data scientists as well. Maybe a few computational biologists with computer science skills. So here that's an opportunity to test the app. You know, I'm giving you enough time so that you can make sure that you can use it. You already have 61 on the internet, 68. So you can do it from the lecture hall or you can do it from Zoom. It works just as well. Oh, yes, let me put again this like. Let me. Can you see this? Do you like. That's a pp 1 dot ethz dot ch. But let's see if the number seems that the number is slowly converging. Is anybody having difficulties or has a manage to connect yet? This is always what happens when you do things live. This is as expected. The first rule is don't panic. All right. So you see it's working again. And now I'm landing exactly where I was. Okay. And that works. Fabio, you can hear me again. Okay. Awesome. So the sound is working. Perfect. So at least now you see that we are live. And that is not actually pre recorded. Well, you obviously know, right? But the people on Zoom. I actually not know. All right. So. Indeed, most of you are computer scientists and data scientists. And that's what happens when you do things. And that's what happens when you do things. Because this is a master's level lecture for computer scientists and data scientists. And it also exists because we've worked at some point because it was becoming very large. So now we have big data for engineers for other departments, which is offered in spring. So that would be next year. And information systems for engineers, which is the equivalent of the bachelor's level database management lecture. Right. And so that's because the data scientists are interested in the system because you are not in the computer science and data science program that should be very few of you, hopefully, just so you know that there are these two lectures for you for the other departments. And you're absolutely welcome to, to attend this. And where we are right now. Is now just have to lose the habit of showing directly. So where we are is right here. Okay. So now I would like to say a few words about, you know, how we do science. the difference between mathematics and physics. Well, you can easily land into a very big philosophical debate, and I know not everybody agrees with me. But basically, in mathematics, I tend to say in modologics that mathematics is necessary in the sense that we can do it from an armchair, right? You just sit and think, you can come up. Full age. It's the world as it is because the world could have been different. Maybe the masses of the particles could have been different. Maybe the laws of physics could have been different. There are many other things that could have been different. And so this is called contingency. And this, we have no other way than through experiments. We have to just play around with the world and see what happens and then figure out everything that's contingent. But recently, we got computer science that added more to the mix. And in particular, this is what we can do with the machine. And we can also, we came up with an entire field of science with theoretical computer science and so on, touring machines, finite state automata, and so on and so on. So this is computer science. And data science, I see as being this last missing piece that is nicely at a sweet spot between physics and computer science. Why? Because like computer science, this is all automatable. You can do it from computer notebooks or with fancy clusters. But it's just like physics because why do you do data science? It's to study the data and understand the world as it is. We could not do it without data. We have to collect data and work on the data. It is in that sense that it is epistemic and contingent just like physics. We collect data and then learn about the world as it actually is. And some of you, right? So this is amazing. But basically the reason we are doing that is that we want to learn about the actual facts about our own world and how it works. So people knew actually we are already wise a long time ago, thousands of years ago, that a good decision is based on knowledge, not numbers. So you need to actually manipulate all of these numbers and all of that data in order to make sense of it. So I would like to go ahead now and continue with a history, short history of databases. How we start and manage data. I'd start with the prehistory of databases and actually it's extremely long ago, probably even more than you might have expected in a database lecture thousands of years ago. Because thousands of years ago we were already storing data and managing data. How? In our brains. It's people who would just observe and you know keep track and remember what's going on. And then they would just spread it to everybody else so to their children, grandchildren, but also you would have people singing from city to city in order to explain what the what people are doing, especially the kings and empires and so on, you know, to make sure that the knowledge is spread. But there is a problem with that is that information can get distorted and can get lost because our brains are not fully reliable and we also need to make sure that it retains over centuries and even millennia. Right. So this is why we needed to actually solve a problem and the problem has been solved thousands of years ago with the invention of writing. Writing was the first time that we actually figured out that we can store data with information on data blets, that's how it was done first, in a way that this is preserved through thousands of years. And in fact, we still have some of it today and this is why we can actually do history. This is what most historians would consider the start of history. Before that, we kind of lost everything. We don't have anything left, but this here gives us a way of actually learning about what our ancestors had been doing. But now I'm going to show you something even more super cool and awesome. This is this thing here. So this is a tablet called the Plymouthon 322. I don't know what you think, but this looks a lot like some sort of tablet, not a clay tablet, smart tablet with a spreadsheet software installed on it. Already back then, the data was structured as a relational table. This is a relational table. Relational tables are thousands of years old. And does anybody know what there is in there, what is stored on this tablet? This is actually information on, you know, the Pythagoras theorem, right? When you have a triangle with a right angle and then you have like three, four, five is an example of possible an integer length that you could have. Well, this is just a list of some of those that were known at the time. Why did they need that? Well, this is mostly for keeping track on the actual length, who owns what when there is a flood, and then we need to be able to reproduce the actual shapes of the lands, right? And this is also used in order to get right angle in construction. And so this was stored in there so that people would have a database, this is a database somewhere where these numbers are actually stored. There are a few... Creeps all of that and understands what there is in there. I think it's in base 20, if you look closer, everything is in base 20 in there. Everybody is staying the same in me too, did I say? Oh, really? Because I have... Yes, seems that Wi-Fi was working fine. But if there is the issue and it repeats, we'll switch definitely to land. Is it back to normal now or do you have difficulty hearing me on Zoom? All right, otherwise I'll just continue. And let me know if the difficulty is persisting and if the difficulty is, it cuts off. Okay. Oh, once in a while. Okay, so let's see if the frequency increases. I'll try to continue and then we'll see if there is any issue that I can maybe try also to reconnect the Wi-Fi or something. All right, and maybe during the break we can try to see if we get a land connection working. All right, thank you very much for letting us know. Okay, so this is the writing. Then came the printing. What is the problem that printing shows? It solves the problem of making duplicates and copies of the same thing. Because if you want to spread books, for example, you need to replicate them. Back then you needed people to do it manually. They would sit for hours and just make manual copies of everything. Typically monks were doing that. With printing with a printing press, this was a way with this free printed characters that you just put on the paper and with this, it's just tickation. So this was the 16th century. More recently, computers were invented. By the way, there is one in this building, I think, on the E-floor below that's one of the early ones we had in the department, a Kray supercomputer. It's worth the visit if you haven't seen it already. But computers automated and made things faster. Because then we could start processing information, we could start manipulating information, we could start querying information in order to be even faster. So this was already kind of the beginning of the true revolution. And in the 60s, we had file systems like directories and files. But back then this is how people would deal with data. They would basically manipulate files on the disk and directly deal with the files, read files and output files. So this is how it was done. And then, then, then something happened in 1970, somebody called Edgar Codd, remember that name, 1970. This was the beginning of database history where the brilliant idea that Edgar Codd had is that people should think of the data on an abstract level in terms of data shapes and model. So tables are quite intuitive. I showed you a clata blade that is thousands of years old. Everybody understands tables. So that was a natural thing to do to say now we should isolate the user from everything that's too complicated on the physical level. They shouldn't have to deal with files and directories and syntax and so on. And we'll just expose everything as tables with rows and columns. And that's it. And the principle behind that was called data independence. We shield the user from what is below. And I'll come back to that in the second unit, probably starting today and then tomorrow on what has been done in the 70s. And then in the 2000s, and this is where we arrived to big data, this wasn't enough. Now we'll explain a lot why this wasn't enough. And more technology schemes. So key deduStores, one of them triple stores with graph databases, column stores, which are very, very large and sparse relational tables and document stores, which basically store collections of trees. So this is what was known as no sequel, not because we don't like sequel, but not only sequel, meaning that there is something beyond sequel. All right, so that was a quick history of databases as it has evolved throughout the centuries. And now I say, you more words on big data. So it's a buzzword, right? With buzzword, it's quite hard to define. So I'm going to try. I'm going to tell you what I think big data is. So first, what we probably can say is that it involves a lot of proprietary technologies, because this is where it was invented. It's not that I'm doing an advertisement for any of these, right? I know I'm not advertising. I'm just saying as a fact, a lot of these technologies actually came from companies rather than universities, probably because there's a lot more financial means in order to have very large clusters that you can use in order to store data, right? So this is how it happens. And we are going to look, of course, into what was done. But for many of them, we'll use an open source equivalent, like Hadoop is an example of open source equivalence. Spark is actually open source. Data bricks would be the company behind it. And so on. All right. So we'll see some company names. There is also a way on the internet now there's search engines for actually specifically looking up data that also exists now. But the best approach to big data is to look at it from the perspective of the three V's. The three V's are volume, variety, and velocity, VVV. So we are going to discuss them volume, variety, and velocity in the next few minutes. And go more into data. So first volume, then we'll do the velocity velocity. Why volume? Well, an answer that you hear from a lot of companies is because we can. So a lot of companies, especially maybe, let's say, 10 years ago or 15 years ago, it began, started realizing that with so much space available for so cheap, why would we delete any data? Let's just keep it just in case. That was the mindset back then. So let's keep all the data. We might need it in the future. Who knows? So of course, as you know, things have changed. Right. Now we think a lot about data protection about who has the right to store data. There is the GDPR in Europe and so on. Right. So just because we can, doesn't need a good idea, right. But technologically, we can store the we can store and keep all of that data. And that works across all levels, right. The infrastructure, the data centers, as we will see, the hardware, the software that runs on on it and the technology that was invented, I include here, maybe machine learning, artificial intelligence, and so on. Another reason is that data has value. Something actually of data is the new oil. Just a new resource that now we can manipulate and store and use and we can extract value from it. Right. So this is also why data is stored. Of course, there's a lot of considerations that are out of scope of this course, right. Because here I'm teaching you the technologies. It doesn't mean that you cannot think critically, of course. Data carries value. It has an impact on the fact that, for example, when you have free products offered over the internet that are collecting your data, right. And this is a reality to also think of. I recommend the course, for example, Big Data Law and Policy, which is a digest science in perspective course that talks about this kind of topics. From a legal perspective, from an ethical perspective, you know, just because you can do it, doesn't mean you should do it. But as I said, in that course, we will focus on the actual technologies and and that that allows us to do these sort of things. Another aspect is that the utility of a joint data set is higher than just the sum of the two. It's the power of the joint. Joins are extremely useful. I'll come back for those of you who might not have those joints in the section two. Joins are super useful when you want to actually cross multiple data sets in order to link records with other records. And this has a lot of value. It's also very expensive. I hope that in this sector, you will understand that a joint is something we try to avoid in some way when we did with large quantities of data, in particular, in terms of complexity, in terms of big O. O of N is what we love in big data. Everything that's O of N, that's something, the sort of things we can spread on clusters and so on. As soon as you're above that, maybe N log N kind of, but as soon as it starts being quite radical, exponential, or my right. So linear is what we love. And hopefully you will develop a feeling of what the complexity, the algorithmic complexity means in big data. Another aspect of data is the collection of complete data sets. It's particularly useful for the website that claim to have a complete data set and have a complete search, for example, to find flights, to find hotel rooms and so on and so on. So completeness is also another aspect of database is. And of course, this is intertwined with the fact that we can actually store all of that data. If you think of the data like a social network with billions of users, it actually fits on the laptop. The list of everybody on Earth fits on the laptop, can compute within terms of gigabytes, maybe terabytes, but it fits. So this is actually a small small database if you just have it as a list. So it fits. In terms of scales, I've already explained to you the prefixes. This is my first assignment to you. If you haven't done so already, I'm asking you to learn this by heart. All these prefixes here. Kilomegagigaterapeta exa zeta yota. It's just the powers of 10, but adding three zeros every time. These are standardized international units. Everybody agrees on them. There was a standardization body who dealt with that. As you can see, it's been done in batches. Not all at the same time because you see that Kilo was very early, megagigag, and then terapeta exa probably were invented at the same time. Why? Because tetra pentah exa. This is the Greek way of saying four five six. And if you count here, there's four tetra. One two three four groups of zero. One two three four five pentah, jifeta, and six groups of three zeros here. Exa, that's our remember. So tetra pentah exa, give you terapeta exa. And zeta and yota probably came later. They may be the new people who gathered say, okay, let's start with the end of the alphabet. We didn't use z and y yet. So this is how these two were added. And then you might think, is that enough? Well, it's remarkable that this is enough to define the length of the universe. This is enough. But I think there's already physics paper out there where they basically run out of zeros. So you can guess how they continued. They basically continued with wx and so on. They didn't give any name to that, but they just used w and x as the way to express even more zeros. But this hasn't been standardized yet. All right. So you must know this by heart. Super important because in big data, you will need these units all the time, but typically with bytes, right, when you talk about bytes. So I have another question for you. I'll switch again to the other screen. Which is to ask you what you consider to be big in terms of the amount of data. When you get a byte, one terabyte, one petabyte, one exabyte. By the way, people on Zoom was it good to connect in the past minutes? Could you hear well? In the worst case, don't worry because I have many, many recordings of previous years. So in the worst case that a recording doesn't go through or whatever there is a problem, I can just reuse some recordings of previous years. So don't worry about that. We'll just have solutions. All right. So this is what's cool with the first week is that we get these large numbers over there. Hopefully you continue to come in the next few weeks. This is actually what I hope. All right. So many of you say petabytes and I would actually agree with you. I think that petabytes is a good number for big data. Why? I would say because gigabytes and terabytes fit on a single computer. Petabytes is when it's not on very enough. You need to go to a cluster. And exabytes, yes, but this is maybe huge rather than big because exabytes, I would say, is almost the scale of humanity. At least it was recently. But now I suspect that some companies actually are reaching the exabytes. Exabytes. So this is already, you know, the scale keeps shifting in terms of what we perceive as big. Right. Okay. So let me come back to the slides. If I have prepared everything correctly, we should every time land exactly where we left. Okay. So you see, you should be impressed actually here by the progress that we've made because going from, you know, a single computer to clusters and so on and maybe Zeta or your tabites at the scale of mankind, human kinds, it's actually going from our scale to the entire visible universe. Right. So that and that was done in just a few decades. This is what's what's incredibly amazing. In just a couple of years, one, two, three years nowadays, we generate as much data as the entire human kind since the very beginning. In just three years, as much as everything since the very beginning, this is extremely impressive the exponential growth that we have here in generating all of that data. So there is another system of units just checking the time. For those who love powers of two, it was a bit of a mess at some point that we use powers of two or not. I'm not asking you to learn that by heart, but just so you know that also exists. Right. If you want to actually mess two to the power of 10, 20 and so on, there's a higher deviation maybe for the larger ones, but that exists. I'm not asking you to know that by heart. Usually we talk in kilobytes, megabytes and so on. It's simple. Okay. Now the shape of the data. Data has shapes, table, you already know. We've known that for actually thousands of years, but data can also be shaped like a tree. We will see tree-shaped data because this is what messy data looks like, denormalized as we will see. This is XML, JSON, YAML and many other ways of doing that. Even data frames. Have you heard of data frames found us? Yes. So this is it also feeds there. Data frames can also be seen as collection of trees. We will see there. These are collections of valid trees, valid JSON, but we'll come back to that. We have graph databases which we will also spend one week on at the end of the semester. We have data cubes. We also talk about data cubes at the end and we have text. Text is in a different lecture that we have information retrieval for more students here. Took information retrieval last semester. So we actually did dealt with that six months ago in information retrieval. This is how we did with text. But basically, there are five fundamental shapes, tables, trees, cubes, graphs and texts. These are the fundamental shapes of data. It is extremely important to understand the shape of your data set because this is what is going to boost your productivity and the performance of your system. If you don't pick the right shape, this is going to be super slow both in productivity and performance. Next, velocity, the third V. So we just keep generating data all the time. Just to give you an impression of why actually we did big data, I'm going to do a quick thought experiment on what has happened over the past few decades. Let's talk about capacity throughput and that... Then the quantity of data you put is the size of your hard drive, for example, a terabyte, eight terabytes. Then the throughput is the speed at which you can read it. How many bytes per second can you actually read from your storage media, hard drive nowadays? And finally, the latency, I think I can actually make this go away. The comments of zoom should go away. All right, so I see that some people on zoom, it's still cutting from time to time, right? So we'll see if we manage land during the break. Otherwise, I'll share last years recording, make sure that you have the... I was saying mostly the same things. Okay, when I keep updating, it's not the same things. I keep making the lecture up to date, of course. All right, in 1956, this is what we had. That was the first commercially available hard drive. How big was that? Well, look at the dimensions. It's like, you know, this big, right? Enormous. And we fit in there an amazing five megabytes of data. Can you imagine? This is enormous. We can read it at 12.5 kilobytes characters per second. And the latency was 600 milliseconds. Well, of course, you need time in order to move the reading head in there, right? Okay, this is what one level looked like. So this is what rotates in there. And then you just read on that. So that was the the IBM Ramak 350, 1956. This is what we have today. Well, you can barely see it because it's so small that we need to actually zoom in. This is not just that I could find. 26... Start DCHC 670. It's right there. 26 terabytes of data. The throughput 250 megabytes per second. The latency of milliseconds and the dimensions like this, right, fits in the hands. So now I have another question for you. Time for you actually experience that. You could be on our files. The progress bar, typically you know how it goes with the progress bar. It goes all the way to 99% and then it's there forever, right? But basically, it's not linear. Why do you think that is? Is it because files have different sizes? Because the progress bar doesn't refresh regularly. And the transfer is actually happening at a uniform rate. Is it not true? You're lucky enough that your laptop always shows uniform progress bars everywhere. Or is it because CPU usage may vary? Let's have a peek at your answers. We have two competing answers, right? What I think I would say... It's not going to close. Yeah, there you go. Oh, it's actually reset everything. Well, anyway, you saw it, right? So it's the fact that the files have different sizes. Here's why. When you have large files, the bottleneck is in the throughput. How fast can you actually read the bits from the drive? Or copy over? And when you have many small files, it's no longer a throughput issue. It's a latency issue. Is that at every file, the head of the hard drive has to move to a different place. So this is the reason why if you measure your throughput rate, it's going to be regular for large files. And for small files, you will have the feeling that it's super slow just because you just keep jumping all the time all over the disk to fetch the new or file. So this is the main reason why it behaves in that way. And this idea of throughput versus latency is going to keep us busy for weeks. We are going to talk about that for cloud storage, for HKFS, Hadoop distributed file system, and so on and so on. It's going to be central in the way we think the throughput versus the latency. Right? Okay, let me move over to the slides. I think it's the fourth. Right. Okay. Okay, so this is the progress made in capacity. This is the progress made in throughput. Yes, yes, it's right there. Small, but it's right there. And this is the progress made in latency. Of course, latency, you want to decrease it, not increase. We are in big trouble, big, big, big trouble. Because what happens, I can show an algorithmic scale, but it doesn't make it any better. Imagine with a book that the capacity is the number of words, the throughput is the speed, the number of words we can read per minute. And the latency is how much time you need to get to the shelf and actually pick the book, right? Or the robots at ETH that actually get the books. You should visit the library by the way. It's very interesting to see the little robots working around and not working but sliding around. Anyway, if you divide this by this, you can compute that it takes you 10 hours to read a book today of that size and with that speed. But now, if you go to the future and imagine that the books in two centuries have evolved exactly like half drives in the previous 70 years, right? This has happened. Then this is the new size of the book. Two centuries and this is the new speed. This is the problem. Now it takes you 11,400 to 50. It has massively increased. This is why we cannot just keep going with the same technologies. It just doesn't work, it doesn't scale. We have a big problem, right? So this is why we have this discrepancy right here. So what do we do? We parallelize. So we are going to parallelize a lot in this lecture. We are going to parallelize. So imagine now you you spread it all over the planet, every human being picks one and reads it. Now you can do it in 10 hours, right? So you solve that with parallelism. So we are going to be doing that with clusters of machines. Technology is on top of these clusters. All right. And the other here, so of course, there is a discrepancy right here that we solve with parallelism. This right here, the throughput versus latency is also very important. This we solve with batch processing. Batch processing is the idea that instead of doing it one by one, you group them in thousands, for example, and you do it a thousand at a time. This is how you solve this problem here. But we'll come back and over and over and over again during the entire semester on these two things. Do we have a question about the connection? Yeah, yeah. No, we really have to deal with that because there's something we need to solve with the Wi-Fi and the connection. Okay. So this is what I conclude about big data. It's all these technologies that help us store and analyze data, solving that issue of the discrepancy between capacity throughput and latency. This is the problem that we are going to solve in this course. Right. Okay. So it's everywhere in the sciences. At CERN, these are enormous numbers of the number of collisions per second. They have 10,000 servers. They have hundreds of thousands of cores. It keeps growing all over the place. They produce 50 petabytes of data every year. They even throw a lot of weight. They don't even record everything. Right. But even what they keep is extremely large. The same goes for astronomy. We try to map the entire sky. Every single star we can find or objects that we can find and keep track of that. This is also enormous. Right. Billions of objects with the Sloan Digital Sky Survey. The last phase ended in 2020 and now it's already continuing with the next phase. Right. So there is a paper. Actually, I will give it to you as an optional read that we actually contributed in our group. That was in collaboration with physicists actually around some data. We try to understand why people in order to analyze the collisions of the particles, why are they not using SQL? Because SQL is actually perfect. It shows all of the data and dependence problems, makes it more convenient and so on. We try to analyze why that is the case. So they are interesting insights in there. It's all related to the sort of things we got during the semester. Their data is partiness. The data frame like and so on and so on. But if you want to go ahead and read, of course, you can have a look at that. DNA is also the kind of data with sequencing that we analyze. A very large number of pairs in our body. We are also making progress in gene editing and so on and so on. We can even store data on DNA that was also done by some people formerly at DTH. They managed to store data as DNA pairs and execute relational queries on that directly on DNA. And this is amazing because it shows again data independence. You can do it on your laptop, on a cluster or on DNA or in a clay tablet that works to it. It's a bit slower, but it works as well. We are almost for the break. I think we can do the break, maybe 15 minutes. Then I'm almost done with this introduction part. Then I tell you about the lecture scope. I'll introduce you to the T18. Some of them are here with us. Then we'll move over to a section about a brush up of SQL and relational databases. Let's take a quick break. 15 minutes. I'll see you at quarter past three for the continuation of the lecture. Thank you very much.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement