Hyper Prologue Log [CL67]

Posted on Tuesday, Jul 18, 2023 | Series: Chaos Lever
Chris introduces us to LogLog, SuperLogLog, and HyperLogLog. I swear these are real things.


[00:00:00.330] Ned: Some deep sides you got going on there, budy. You doing all right.

[00:00:07.410] Chris: It’s one of those days how do I describe them? The days that end in Why where I don’t actually, like, feel all that great.

[00:00:18.690] Ned: So, yeah, you’re really embracing your 40s.

[00:00:21.650] Chris: Is what you’re stop whining about it. Sure, it’s perfect normal for somebody to pull or twinge or tweak or bust something up in your hip by doing something as outrageous as stepping out of the shower.

[00:00:35.930] Ned: God, I love being old. Yeah, that’s been my experience. Absolutely. Last week, I woke up and my shoulder hurt. Not because I’ve been doing anything strenuous with that shoulder just for funsies, just hurt. I went to bed that night, woke up the next day, it was fine again. It was just like, no, I’m just here for one day. Guest appearance.

[00:01:04.950] Chris: I’ll be back, though.

[00:01:06.860] Ned: That is true. Yeah, they go around on tour once a year or something like that.

[00:01:14.150] Chris: At least most of the time that you injure yourself. It’s in hilarious dramatic fashion, like falling into a hedge. Or falling out of a hedge.

[00:01:24.270] Ned: Hey.

[00:01:24.840] Chris: Or falling near a hedge.

[00:01:28.010] Ned: I do like hedgerows. There’s nothing I yes, it might be a problem. Only if I’m in a Led Zeppelin song. Let’s get started.

[00:01:43.010] Chris: Okay.

[00:01:43.860] Ned: All right. Hello, alleged human, and welcome to the Chaos Lever podcast. My name is Ned, and I’m definitely not a robot. I am not defined by a series of mathematical equations. I am an individual silicon based entity, and I’m totally capable of independent thought, regardless of what that specialist said. I did a push up this week on purpose. I’m smart, not like everyone says. Hey, wait, what? Who wrote this with me? You did. Keep going. Who’s also here. Why do you hurt me so?

[00:02:26.750] Chris: Did I mention I was having a rough week? I needed to lash out at the people that care about me. That’s how these things go, right?

[00:02:34.690] Ned: As is tradition. Absolutely healthy.

[00:02:37.890] Chris: How people are supposed to act in the world.

[00:02:41.170] Ned: Yeah, definitely don’t do any introspection on that or talk to me. No, that’s madness.

[00:02:46.510] Chris: Intro. What?

[00:02:49.510] Ned: Yeah, just like you said. Lash out at the people that care about you and push them away as far as possible. No man is an island except for you. Arm’s length and then an extra 6ft just for good magic.

[00:03:08.910] Chris: Look, I’m just saying, Inspector Gadget had the right idea.

[00:03:15.470] Ned: Putting a helicopter in your hat? Because that’s a good idea and I can get behind that.

[00:03:20.630] Chris: All right. I’m just saying Inspector Gadget had a lot of great ideas.

[00:03:24.710] Ned: You’re not wrong. But let’s talk about some other tech garbage. And you had old thing about a thing that I’ve never heard of. So this is going to be educational for everyone.

[00:03:39.910] Chris: So just to set the stage, this is not what I expected to talk.

[00:03:43.690] Ned: About this week, okay?

[00:03:45.630] Chris: I had a whole thing I was very excited about it. I started researching this thing and then I came across this phrase that just broke my brain. It is hyperlog log, which is just about the greatest name for an algorithm that I think is even possible.

[00:04:07.010] Ned: No notes at all.

[00:04:10.850] Chris: All right, how to begin?

[00:04:15.890] Ned: At the beginning.

[00:04:18.370] Chris: You know about math, right?

[00:04:20.450] Ned: I’ve heard of it.

[00:04:21.990] Chris: That allegedly purist of sciences that underpins and explains literally everything about the physical world in borderline incoherent combinations of symbols, numbers and letters.

[00:04:31.360] Ned: Yes, that’s the one I heard about.

[00:04:33.030] Chris: It also that thing that we try to do when we want to figure out what a reasonable tip is and just end up saying effort, leaving $30 and hoping for the best.

[00:04:43.210] Ned: Are you familiar with bistronomics?

[00:04:46.730] Chris: I’m scared.

[00:04:48.750] Ned: It’s a made up version of math that Douglas Adams created, and it was used to compute how to tunnel through space because the mathematics of calculating the check for a restaurant were so complex that you could use them to literally fold spacetime.

[00:05:11.730] Chris: That seems fair.

[00:05:13.160] Ned: And as always, Douglas Adams was right.

[00:05:17.910] Chris: I mean, I normally just do like, tax times three and then round up.

[00:05:23.190] Ned: Oh, sure, I have a calculator on my phone.

[00:05:30.810] Chris: So it turns out that math is important in computer science.

[00:05:34.970] Ned: No, I know.

[00:05:37.130] Chris: I was also surprised. All right, so what math does, and especially in these kinds of uses, computer science type of circumstances, is take something that we humans love to think that we’re good at, does it better incredibly quickly, and most importantly, at scale. Also, spoiler alert, we humans are kind of terrible at most things.

[00:06:07.750] Ned: Yeah. You ever hear about the people who think they can multitask?

[00:06:12.150] Chris: They’re so cute.

[00:06:13.870] Ned: There’s like an inverse relationship between how good someone thinks they are at multitasking and how good they actually are.

[00:06:22.330] Chris: Well, there’s no such thing as multitasking is the problem there’s? Context switching?

[00:06:28.100] Ned: Yes, context switching.

[00:06:30.910] Chris: You’re a context.

[00:06:33.550] Ned: Only on Friday. Only on Fridays.

[00:06:37.170] Chris: Anyway, back to math. Let’s take a simple example, something that people who have ever taken an Intro to Computer Science is familiar with, the concept of sorting and searching. If you don’t have everything sorted, how can you search it? So there are a lot of complicated answers to those questions, and algorithms were created over literally decades to solve them. Fun things like bubble sort, insertion sort, selection sort have all had their day in the sun. And if you’ve ever taken an Intro to JavaScript class, you’ve definitely written all of the above.

[00:07:22.050] Ned: Indeed.

[00:07:24.750] Chris: So actually, this sent me down a different rabbit hole that I had to cut most of. But there is a fun link in the show notes to a sorting visualizer if this kind of thing is your bag, baby. What I want to talk about, though, is a different problem. Similar idea, different problem. And that problem is how do you count the number of unique visitors to a website? Now, as a question.

[00:07:59.690] Ned: I would ask first, what do you mean by a unique visitor?

[00:08:04.330] Chris: A good question. So counting visitors is easy, right? That’s just a click plus one on a counter every time the page gets loaded. Sure. Visitor has differentiation, right. How do you know that it’s not one person clicking the same page 1 million times, like Derek on a bender, because mostly what he wants to see is the page go blinkety blink.

[00:08:36.150] Ned: It’s true. Yeah.

[00:08:39.670] Chris: It’s an important problem. And it’s not just a problem of website analytics and advertising metrics, even though it kind of totally is a lot of that, which is why unsurprisingly, Facebook and Google have spent a lot of time trying to solve this problem. But there are other practical considerations. This is information you need for say, network analytics. Same problem applies. Is it one person clicking a million times or is it 1 million people who need to get in the front door? Your firewall is absolutely going to want to know the difference between those for reasons. It’s also helpful for knowing who’s using all the bandwidth, where are they coming from, and helping to organize your network and optimize the infrastructure. Because there’s a very different thing between somebody that’s effectively how do you describe a network DDoS operation if you can’t clearly define individuals, right?

[00:09:45.520] Ned: And I mean having that level of insight into what’s happening and also it kind of led to the need for having edge computing or at least distributed caches all across the world to deal with individual visitors who want the data close to them, especially when they’re trying to grab it. So Akamai, for example, has caches everywhere. So when you load something from a website, it’s probably not loading from the original website server, it’s loading from whatever the closest cache is. And you still want to know how many visitors you’re having. But you need to gather that information now, not just from the cluster of web servers that’s running your website know US East one because that’s where everybody deploys everything, but also who’s just hitting the website on those cached endpoints, right, because they’re not actually going to hit the web server and log a visit there.

[00:10:48.670] Chris: And there are plenty of other places where this kind of information is interesting and necessary. Like it’s got value in markets analysis, meaning the stock market and other data lake type mining expeditions. But that’s going to get a little too esoteric, even for me. So let’s just move on.

[00:11:07.430] Ned: Okay, fair enough.

[00:11:09.270] Chris: So back to the basic question. How do you do it? Well, the original way of doing it, which is what our human minds come up with, is basically just keep a list, figure out a way to fingerprint an individual user, keep that person’s information on a list or in a database or what have you. And then every single time a new visit happens, you check every single name on the list. If that person is new, you add them to the list, and now you have plus one on your unique visitors list. Problem? This is not scalable, right? You have to figure out a way to make the individual Identifier as small as possible. So let’s just say you index each user to some kind of a serial number, like a unique Identifier, ten bytes in memory. That allows you a 1 billion unique user space. Problem is, if you do the math, that becomes a list of 10GB in memory.

[00:12:28.830] Ned: Okay?

[00:12:30.450] Chris: Just to check, because it can’t be on disk, right? It would never function fast enough. This is why it becomes very unworkable. If you have a website that’s got 300 unique users, you can do it the old fashioned way, and it will be fast enough that it won’t make a difference. If you have a Google. It doesn’t.

[00:12:55.850] Ned: A Google like a Google plex of users or a Google like you are the company Google.

[00:13:01.530] Chris: Yes to both questions.

[00:13:03.390] Ned: Fair. All right.

[00:13:05.450] Chris: And think about what we were talking about before, all the different areas that this idea is important and it becomes even more unworkable. Think about a firewall that has to perform this task in more than one way. You want unique visitors. What? Per VLAN, per port, per site? You just ran out of memory, right?

[00:13:26.560] Ned: This is kind of what DDoS Distributed Denial of Service attacks rely on, is the fact that you can exhaust the resources of something in the system that’s responsible for tracking individual sessions. So overwhelm that memory bank, or whatever it is, with too many sessions for it to possibly handle, and at some point the whole system just falls over.

[00:13:50.330] Chris: Exactly. So that’s a con. There’s a pro to this approach. And that pro is the certainty you know for sure, down to the individual, all your individual unique users. The downside is that memory disappears. Unless you are spending tens of thousands of dollars and putting gigabytes or terabytes or petabytes of memory into a firewall, it just won’t work.

[00:14:20.210] Ned: Yes.

[00:14:21.190] Chris: And to quote a great philosopher, there’s got to be a better way.

[00:14:25.500] Ned: There’s got to be a better way.

[00:14:31.110] Chris: Underrated in its time. There, I said it.

[00:14:33.260] Ned: Indeed.

[00:14:35.850] Chris: And luckily for us, there was a.

[00:14:38.750] Ned: Better way because math hooray math to the rescue once again.

[00:14:46.490] Chris: So I’m going to spoil the ending here because I think it’s important to note the benefits of Hyperlog log. What it can do is estimate the total number of unique users when talking about sets of about a billion plus with 98% accuracy, taking up 2 memory, which, if you’re keeping score at home, is very different than the 10GB of memory that we were talking about earlier. Okay, so that’s the pro and the con. You sacrifice that perfect certainty, but you gain back effectively 100% of your memory.

[00:15:30.970] Ned: That’s a fair trade off for most scenarios, I think.

[00:15:34.620] Chris: I mean, if that’s not worth calling hyper I just don’t think I know what is. Now here comes the fun part, where we try to explain how this actually happens.

[00:15:47.710] Ned: Okay?

[00:15:49.630] Chris: Unsurprisingly, this problem is predating computer science. In mathematics. It’s described as the Count distinct problem, also known as the cardinality estimation problem. Thank you, Wikipedia. Okay, and it’s simply defined as the problem of finding the number of distinct elements in a data stream with repeated elements. Right, because what we want is uniqueness. If Ned logs in ten times, we only want to check Ned once.

[00:16:20.570] Ned: Right?

[00:16:21.690] Chris: So this all comes out of set theory, which is also how we know that some infinities are bigger than others. And if you really want to melt your brain about that, I suggest the Netflix documentary on the subject, and you can see how circuitous my research was for this article.

[00:16:37.630] Ned: Let’s talk about the empty set, shall we?

[00:16:41.570] Chris: No. Okay, so historically, several algorithms have been proposed to solve the Count distinct problem, each with its own set of pros and cons. Some popular ones include a hashing method, the bit vector approach, the probabilistic counting algorithm, and the log algorithm.

[00:17:01.770] Ned: OOH, log.

[00:17:03.350] Chris: That one sounds familiar, right? We’re going to spend most of our time talking about that. Okay, so here’s how it works. Just like in the other example, what HLL does is turns each individual unique visitor into a unique binary string. And this string in binary ones and zeros.

[00:17:28.180] Ned: Yes.

[00:17:28.960] Chris: You see where I’m going with this?

[00:17:30.870] Ned: I understand that it is binary, so it’s all ones and zeros.

[00:17:35.090] Chris: What we’re creating is a randomness pattern, and that random sequence that has bits 50% chance of being a zero or a one. And this is where the math magic starts to happen, okay? You create this identifier, it’s only ones and zeros. We get back to powers of two pressed to digitation. And by that I mean answering the question of how many unique visitors a site had is exactly the same as answering, how many times did heads come up when I flipped a coin? If that sudden turnabout kind of baked your noodle, you’re not alone.

[00:18:16.450] Ned: Yeah, I’m not sure I’m quite grasping how that works.

[00:18:22.530] Chris: When you flip a coin, you have a 50 50 shot at it coming up either heads or tails, right?

[00:18:29.970] Ned: Yes.

[00:18:30.930] Chris: So logically, we would assume it would just basically go heads, tails, heads, tails, heads, tails.

[00:18:39.170] Ned: But that’s not what happens because it’s a 50% chance each time. So you would expect it to go heads, tails, tails, tails, heads, heads, tails, tails, or something along those lines. It’s not going to be 1010-1010.

[00:18:53.390] Chris: Right? If that was the case, then Roulette would never, ever happen. So what you can do is you figure out the percentage chance of that thing happening. Like, say, for example, what is the chance that heads comes up five times in a row? It’s actually a pretty simple math. It’s one over two times, one over two times, one over two times, one over two times, one over two. And I should have done this math in advance. I think that gets us to like two. 4816, 32, 64, one over 128.

[00:19:28.170] Ned: That sounds right. We’ll go with that.

[00:19:29.510] Chris: Yeah. That’s a unique number. Just like the amount of unique visitors to a website is a unique number. What the HLL is doing is taking the probabilistic chances based on the information given it to us by the unique code for each visitor, averaging that over time to get to a shockingly accurate and very close to 100% unique number of visitors.

[00:20:00.970] Ned: So that just sounds like magic.

[00:20:04.690] Chris: It does.

[00:20:06.100] Ned: How big is the unique binary string that each visitor gets?

[00:20:11.730] Chris: It is either 10, 12, 14 or 16 bits.

[00:20:15.820] Ned: Okay, so I have a 16 bit unique binary string that’s generated when I visit and it just adds that to a much longer string.

[00:20:27.110] Chris: What it does is looks at effectively, it looks at the numbers of zeros and adds that to the counter. So the extreme number of zeros means that you have more unique users because the chances of all those zeros coming up in a row is low. But it happens.

[00:20:46.750] Ned: Okay.

[00:20:49.310] Chris: So you take the numbers, the 16 bit string and you break it into four and we’ll get back to that in a second. But this is how you get rid of randomness and outliers. The general idea is if you see five zeros in a row, the chances are that the amount of unique users has to have hit that percentage. That is the inverse of the flipped coin example.

[00:21:20.810] Ned: Okay, I will agree with you.

[00:21:23.040] Chris: Magic.

[00:21:23.650] Ned: Yes, math magic. That’s all you had to say.

[00:21:27.550] Chris: It’s the exact same problem flipped on its head, no pun intended. Because what we want is that number. We don’t want the percentage.

[00:21:40.010] Ned: Right?

[00:21:40.290] Chris: So when we have the number of zeros in a row, you get the percentage, you flip it into the number and estimating number of unique users. This is also why it happens in such a crazy small amount of memory. Because even as insane as everything I just said sounds, complexity wise, it’s not that bad. This is a function that you can write on a single line rather than some of the more insane ones that are pages and pages long.

[00:22:17.710] Ned: Right?

[00:22:18.620] Chris: So this was the basics of log log, and then what they did was figure out ways to make it more accurate. So the advantage of not having to use all the memory was obvious from the beginning and log log was something like 65% accurate. That’s still damn good when you think about the fact that you’re using effectively 0% of the memory, but it’s not.

[00:22:44.790] Ned: Accurate enough for a lot of use cases.

[00:22:47.850] Chris: Correct. So when we talked about the dividing the sets into four and then using the averages of that, that’s where you got super log lock that’s Ll. That was better. Still not perfect. What the latest version did was take those sets, averaging them together using something called a harmonic mean, hence hyperlog lock.

[00:23:24.290] Ned: Okay.

[00:23:25.490] Chris: One of the main reasons that this type of averaging is necessary is that binary math works on powers of two, right? The rest of the world doesn’t. So this is why it also becomes an approximation, and with hyperlog log, we can get to the best damn approximation it can possibly be.

[00:23:51.290] Ned: Okay?

[00:23:54.330] Chris: Now, astonishingly, there are researchers out there who are still trying to improve on even this. There is a paper out there, I think it’s by Google engineers, that proposes hyperlog log, which I can’t tell if I’m being punked.

[00:24:15.490] Ned: Fair enough. If you see Aston Kutcher hiding behind a couch, you know you’re in trouble.

[00:24:21.410] Chris: There’s also HyperBit bit, which, according to its author quote, the idea is to keep an estimate of IGN, and then for each substream, maintain one bit that marks whether IGN might be increased by one, and another bit that marks whether it should be increased by two. When more than half the substreams argue for increasing the estimate of IGN, the algorithm does so and resets the bits appropriately, unquote. I don’t think I need to explain any more than that.

[00:24:50.180] Ned: No. Got it all. Thank you. That was the easy one. HyperBit bit. I’m there for HBB, let me tell you.

[00:25:01.250] Chris: In researching this, I was once again reminded of, let’s say, my limitations. I understand what this algorithm is trying to accomplish. I read many words. I said some of them out loud today.

[00:25:22.090] Ned: You did very well, I might add. The prestige digitization in particular. I was impressed you got through.

[00:25:29.290] Chris: Thank you for noticing. I still don’t completely get it. I believe it, but, like, what?

[00:25:46.190] Ned: That is my reaction, having not read any of this, and you just introducing it to me, is like, I believe you. I believe that this works because math can be magical. But also, I don’t understand it, and I’m not sure I ever will.

[00:26:03.970] Chris: I don’t think that we need to necessarily no, this actually is the beauty of science and math, is that we don’t need to understand how it works. We just need to know that it works.

[00:26:16.250] Ned: But, yeah, that’s sort of the thing, is whether or not you actually understand how it works, you can see that it works by just testing.

[00:26:24.970] Chris: Right.

[00:26:27.530] Ned: I think there’s a name for oh, it escapes me. Something with an S. I don’t know. We’ll have to look it up on Google or something.

[00:26:38.450] Chris: We’ll probably never figure it out.

[00:26:39.550] Ned: I know, right?

[00:26:40.630] Chris: Surely there’s some kind of method to but we won’t be able to yeah, no.

[00:26:45.970] Ned: Anyway, to relate this back to my days in university, college, whatever you want to call it, I remember taking I want to say it was calculus four, whatever that is. They called it Calc Four. And there was definitely a point in calc four, where I realized that I am no longer understanding what they’re teaching me, and I don’t think I can. Like, I had reached the hard limit of what my brain was capable of doing.

[00:27:22.190] Chris: It’s a humbling moment that comes for all of us.

[00:27:24.980] Ned: It was uncomfortable, to say the least. I don’t know about you, but math was always my strongest subject. All through high school and even through most of college, I was pretty good at math. And then I hit a certain level and I was like, oh, I’m not.

[00:27:45.910] Chris: Something about myself today.

[00:27:48.020] Ned: And it was not pleasant. So I did the natural thing and I ran away and dropped out of college, as one does. I wish this were a joke, but it’s not entirely.

[00:28:00.730] Chris: So you’re saying is that math is what ruined you.

[00:28:04.810] Ned: Let’s say it was a contributing factor. The good news is there’s different kinds of math. And what I discovered is I actually am pretty good at financial engineering math, which can only be used for evil. So I decided not to pursue it.

[00:28:23.570] Chris: Well, you got that going for you, which is good.

[00:28:26.610] Ned: I do. Now, I want to point something out. Throughout this entire discussion, we kept talking about hyperlog and log log and super log, and you did not a single time break into the log song. And for that, I just want to take a moment and admire your restraint.

[00:28:51.930] Chris: It was difficult. There are a lot of people that I need to thank. Not going to, though.

[00:28:59.740] Ned: No, that would ruin the mystery. But, yeah, I appreciate that. And hey, folks, that’ll do it for today’s episode of Chaos Lever. Just as a quick reminder, we are breaking the show up now. The lightning rounds, as we used to call them, are now their own episode, tech News of the Day or Tech News of the Week. And you can find that on Thursdays, so that will appear in your podcatcher of choice very shortly. But that’ll do it for the main historical episode today. Thank you for listening. Or something. I guess you found it worthwhile enough if you made it all the way to the end. So congratulations to you, friend. You accomplished something today. Now you can sit back, pull up the latest episode of The Simpsons on your smart device and marvel about the fact that the cartoon predates the device, the network and the protocol you’re using to consume it. How about that? You can find me or Chris on Twitter at ned 1313 and at heiner, respectively, or follow the show at Chaos underscore Lever if that’s the kind of thing you’re into. Show notes and the sign up for our newsletter are available@chaoslever.com if you like reading things which you shouldn’t.

[00:30:11.470] Ned: We’ll be back in a couple of days to see what fresh hell is upon us. Tata for now.

[00:30:23.330] Chris: What are those things called when they’re little fried bits of potato? Tater tots. That’s what I want tater tots.

[00:30:29.970] Ned: OOH, I like tater tots too. I like it when they’re tachos.

[00:30:36.170] Chris: Now you’re getting fancy.

Show Notes

Hyper Prologue Log

Episode: 67 Published: 7/18/2023

HyperLogLog - Just About The Greatest Name For An Algorithm That I Think Is Even Possible

Intro and outro music by James Bellavance copyright 2022


Chris Hayner

Chris Hayner (He/Him)

Our story starts with a young Chris growing up in the agrarian community of Central New Jersey. Son of an eccentric sheep herder, Chris’ early life was that of toil and misery. When he wasn’t pressing cheese for his father’s failing upscale Fromage emporium, he languished on a meager diet of Dinty Moore and boiled socks. His teenage years introduced new wrinkles in an already beleaguered existence with the arrival of an Atari 2600. While at first it seemed a blessed distraction from milking ornery sheep, Chris fell victim to an obsession with achieving the perfect Pitfall game. Hours spent in the grips of Indiana Jones-esque adventure warped poor Chris’ mind and brought him to the maw of madness. It was at that moment he met our hero, Ned Bellavance, who shepherded him along a path of freedom out of his feverish, vine-filled hellscape. To this day Chris is haunted by visions of alligator jaws snapping shut, but with the help of Ned, he freed himself from the confines of Atari obsession to become a somewhat productive member of society. You can find Chris at coin operated laundromats, lecturing ironing boards for being itinerant. And as the cohost on the Chaos Lever podcast.

Ned Bellavance

Ned Bellavance (He/Him)

Ned is an industry veteran with piercing blue eyes, an indomitable spirit, and the thick hair of someone half his age. He is the founder and sole employee of the ludicrously successful Ned in the Cloud LLC, which has rocked the tech world with its meteoric rise in power and prestige. You can find Ned and his company at the most lavish and exclusive tech events, or at least in theory you could, since you wouldn’t actually be allowed into such hallowed circles. When Ned isn’t sailing on his 500 ft. yacht with Sir Richard Branson or volunteering at a local youth steeplechase charity, you can find him doing charity work of another kind, cohosting the Chaos Lever podcast with Chris Hayner. Really, he’s doing Chris a huge favor by even showing up. You should feel grateful Chris. Oaths of fealty, acts of contrition, and tokens of appreciation may be sent via carrier pigeon to his palatial estate on the Isle of Man.