The Anatomy of AI: The Tech Behind the Intelligence [CL99]

Posted on Thursday, Mar 28, 2024 | Series: Chaos Lever
AI’s infrastructure, from the coding languages that power the models, to the networking that connects it all. Plus, a look into what effects this has on our planet..

Transcript

Ned: Hello? Hello? Can anyone, can anyone hear me? I’ve, I’ve fallen down a hole. It’s a very, it’s a very deep hole. It was dug by the Ultra Ethernet Consortium and I can’t get out. Can someone, can someone throw me like a Cat6 cable or something? That would, all right, well, I guess I won’t be able to record a new episode until I, until I get out of this hole.

So while you’re waiting, I, I guess you could listen to this other episode that Chris and I did all about the infrastructure of AI. It’s going to be really pertinent to this ultra Ethernet thing. If I, if I can ever get out of here,

I’ll see you next week.

So you, you wanted this earlier, huh? You finally, uh, deciding that eating at a normal time is a good idea? I

Chris: mean, I had cookies for breakfast and peanut butter and jelly for dinner. So eating in normal, really not my cup of tea. See what I did there? Wow,

Ned: you did well there. Cookies for breakfast is not necessarily bad.

If I look at the sugar content of any given cereal, cookies might actually have less sugar. So out of control. It really is. I thought for a little while there, I was like, I’ll be healthy. And I’ll eat multi grain Cheerios. No, you didn’t. I, okay, well.

Chris: Oh, it’s a bit, right. Sorry, go ahead. Ah,

Ned: you jerk. So, I thought, I’ll try to eat a little healthier in terms of breakfast cereal.

I won’t have like the Fruit Loops. I’ll go with, say, multi grain Cheerios. It’s Cheerios. It’s probably healthy, right? Turns out, Fruit Loops and multi grain Cheerios have the same amount of sugar in them. Nice. Yeah. So there’s that.

Chris: Which is, I mean, I had the same problem when I got my, uh, my family pack of marmalade and molasses bits. Very surprising. Very surprising.

Ned: That’s a different kind of M& M, but still delicious. So my dog, who sleeps in the room with us because she has severe, um, abandonment issues, she decided that she would chase some sort of fluffy animal in her sleep last night. Many times. Those are fun. Yeah. So, I was woken up by these little muffled barks, like, Like she was definitely there was a bleed over from her dream into reality and I was getting the short end of the stick on that one.

Thanks, dog. Or she gets to sleep all day and I do not. So I think she’s fine with it.

Chris: You are seriously not doing well in this in this arrangement. It doesn’t

Ned: sound like yeah, it’s not necessarily a symbiotic relationship. You know, my daughter asked me yesterday. What a parasite was. I didn’t have a good answer.

She also asked me at the dinner table, and my wife was like, don’t, let’s, let’s not get into details. Anyway, that has nothing to do with what we’ll be talking about, or does it? Because we’re going to talk about how to AI. Is that like? Hmm,

Chris: I think there’s a couple of stretches we can make.

Ned: I might not have to stretch that far.

Oh, facehugger is going to come up, and it has nothing to do with the movie Alien.

Chris: It’s also not called Facehugger.

Ned: It’s called Hugging Face, whatever. AI has been in the headlines almost constantly for the last Eternity? Also known as 18 months? Despite all that fanfare, the chat GPT ing, and the faces being hugged, I have to admit that I still don’t know how it all works.

Not that anyone really does. I mean, for me at least, beyond the prompt text that I submit to chat GPT or DALI, I have no idea what’s happening in the background. Like, what’s the software stack that’s powering it? What type of operating system is it running on? Orchestrator, applications, etc. What is behind the massive training models for something like GPT 3.

5 or 4? And what about all the hardware that runs it? Like, I know there’s a lot of GPUs, because I’ve been told that many, many times, and NVIDIA is worth 110 trillion, but beyond that, I don’t really know. So I decided to dig through some of these layers and try to trace down from the prompt to the physical servers, and y’all get to come with me.

So get in, loser. We’re gonna learn how to AI. Can I do the radio?

The driver owns the radio. I thought we established this. No one wants to listen to your Do you have one

Chris: Mama’s in the Papa’s CD? One. It’s been 30 years.

Ned: And still not long enough, oddly enough. 30 years ago, we didn’t have CD players in cars, which was another big problem.

Chris: Yeah, you had to do the, oh man, do you remember you had to do the, the tape that had the cord that came out of it, and you put it into the tape player, and plug it in your Walkman, or your Discman, I’m sorry, your Discman.

Ned: Your Discman, which had the skip protection.

Chris: Which never worked or killed the batteries in 90 minutes. Those were your choices.

Ned: One of the two. Yeah, I had that full setup. I had the Velcro on the dashboard that would hold the Discman in place because, you know, it wasn’t built in or anything. Yeah, if you go to Tech

Chris: Babies love it.

Ned: I think we both watch Technology Connections, the YouTube channel. He has a whole thing on that little tape adapter and how it actually worked. So, you know, if you’re interested, definitely check that out. I’ll try to remember to put that in the newsletter. Hey, we have a newsletter. Hey, all right. But AI, back to the thing that we’re talking about.

That’s super important. So I have to admit, it was kind of hard to figure out where to start because AI is not a simple or small concept to deal with. I had a couple ideas. NVIDIA seemed to have a few white papers on their hardware. That seemed promising. OpenAI, I hoped they’d publish something about their actual hardware and software stack.

Turns out, nothing recent. Probably because, you know, it’s proprietary or something. Hugging Face is a popular open source alternative. I bet they have a getting started guide, which they kind of do. And I also know a couple people who might not mind answering stupid questions. It was not as many as I hoped.

Chris: It was just a dog, wasn’t it?

Ned: It was just the dog. She had some, some input, or really some output. It was not helpful. But I did do a lot of reading and I’ve assembled what I think is a decent overview of the software and hardware that is part of training and using AI. So, let’s start with the model training portion of things.

According to a 2016 post, which is, if I’m doing my math, a million years ago, OpenAI was using a combination of Python, TensorFlow, Numpy, or NumPy, Keras, and Anaconda. Some of these terms I’ve heard before, but aside from Python, I don’t have any kind of handle on what they do. Spoiler, it’s all Python.

Chris: It’s Python all the way down.

Ned: Almost. Almost. Python is, for those who haven’t used it, a general purpose programming language. I have no idea. Had the good fortune to use it here and there. It’s important to note that Python isn’t a compiled language. It’s not like Go, where you write a program and then you compile it. Tell Python to execute your Python scripts, and it does that.

Because it’s pretty user friendly and straightforward to use, it’s been extremely popular in the machine learning and data science community for quite a while. If you’ve ever heard of Jupyter Notebooks that are spelled J U P Y T E R, they’re spelled that way because Python. And Pandas is another Python project.

Chris: For the casuals in the audience, can I just, you know, I just want to take this moment to sympathize, and all of us at the same time.

Ned: There’s gonna be a lot of new

Chris: Yeah, no, no, I just, I kind of wanted, I wanted to, I wanted to put it all together into one and just get it out of the way early because we have to spell things dumb,

apparently.

Ned: This is not the last instance of that. So yes, prepare yourself. So that’s Python. Python, it’s a programming language. You may or may not have heard of it and may have used it. TensorFlow is a machine learning library with APIs available for both Python and C TensorFlow includes support for CUDA enabled GPUs on Linux or Windows.

We’ll get to what CUDA is in a bit. You write your model code in Python. And you use the TensorFlow package and the functions that are contained inside that package.

Chris: Right. And for a long time, that was where the magic happened. And it still sort of is, but that’s the most important one from a software

perspective, right?

Ned: Yes, that and one other package, which is NumPy, which is spelled Numpy, N U M P Y. And once I typed it, I was like, oh, P Y is in the name, so this is probably a Python thing and in fact it is. It’s a Python package developed to support scientific computing. It has support for matrix math, which if you know anything about AI is a fundamental component of AI and machine learning.

Being able to do linear algebra and matrix math.

Chris: And this is the part

of the show where Ned’s going to take 15 minutes off the cuff and explain matrix math. All right.

Ned: Okay, so you start with the matrices. I am not going to do that. I took linear algebra in college and then immediately forgot all of it because Why wouldn’t I?

The other one is, uh, Keras. I think that might be how it’s pronounced. There’s no P or Y in it, so I was surprised to find out that it’s, it’s Python adjacent. Keras is a framework that’s built on TensorFlow, but it’s supposed to streamline the whole machine learning development process. So it’s delivered as a package in Python.

Well, starting to think that maybe learning Python is slightly important if you want to get involved in this, but essentially it’s there to simplify your use of TensorFlow by introducing some higher level functions and abstractions. Lastly, we have Anaconda, which is both a company and a product. The product is basically a ton of Python packages that have to do with data science, machine learning, and AI, all packaged together and placed inside a nice and friendly development environment, or IDE.

So you would probably develop your machine learning model in Anaconda using all the packages that we’ve already covered.

Chris: So far.

Ned: Okay. So that is the software you’re going to use to build your model. And when you’re building it and testing it, you’re probably going to run it locally or on a system with one GPU card.

But at a certain point, you’re going to need to scale up. So what are AI folks doing to orchestrate runs? Because it’s fairly straightforward if you just have one box and you’re like, go train on this model for the next 24 hours. But what if you want to distribute that workload across a whole bunch of nodes, and you want something to keep the run going, restart it when necessary, all those kinds of things?

How do you orchestrate that and schedule it? For starters, many of the projects I saw are all using containers. And there’s no real surprise here. With all the libraries and dependencies and particular versions of Python and TensorFlow flying around, putting that all in a container ensures that the needed versions are all contained in that box and you have a consistent experience regardless of which system you’re deploying it on.

So now we have to schedule containers. And, you know, when I say scheduling containers, what’s the first word that comes to mind?

Chris: Boo!

Ned: Yes!

Chris: Oh, I meant Kubernetes.

Ned: Oh, yes.

Chris: But I repeat myself.

Ned: Boo! Uh, so Kubernetes, it might seem like the obvious solution, and in fact, back in 2016, that’s exactly what OpenAI was using on top of AWS.

And that’s when they wrote the paper. And then, once they started getting a ton of funding, they sort of clammed up and they aren’t really talking about their internal stack as much now, but from what I understand, it’s still broadly using Kubernetes. Since then, there have been some specialty orchestrators that have sprung up, and they better understand the challenges behind scheduling what is going to be thousands of containers across ten or maybe a hundred nodes.

In particular, placement must take into account the hardware topology, including things like NUMA, the infranode placement, and GPU connectedness. That’ll make more sense when we get to the hardware section. So I came across two different vendors that are in this space. There were a ton more, but time is a finite thing.

Run. ai created their own orchestrator, which they claim does a better job than Kubernetes. I wasn’t able to verify that, but that’s their claim. And then there’s a true. It must be true. It’s on the internet. And then there’s a graduated project from the Linux Foundation called Flight. And. Of course it’s not spelled like it sounds.

It’s F L Y T E. And why they couldn’t spell it with a PH, I don’t know. Like, just go all the way guys. Flight leverages Kubernetes, but adds its own scheduler and orchestration layer. So you deploy Flight using a Helm chart onto your Kubernetes cluster, and then it actually handles, takes over the scheduling of AI jobs for Kubernetes.

Says, I know what I’m doing. You go away now. So that’s the orchestration layer. From an operating system perspective, it appears to be Linux all the way down. Yes, you can do local development on a Windows box. But as soon as you move off the local box, it’s going to be Linux. Let’s be honest. Special mention of Ubuntu in a lot of the documentation I read.

Red Hat was also called out as a viable option. You could also sort of bring your own flavor of Linux, whatever it is, but Ubuntu was the one that people seemed to like the most. Maybe that’s because of the relationship Microsoft has with them. I don’t, I don’t really know. Virtualization is an option, but it seems like a lot of organizations prefer to go with bare metal as opposed to adding the abstraction layer that virtualization introduces, especially when it comes to virtualizing physical GPUs.

Although, VMware with their latest version of vSphere, I guess that’s vSphere 8, has some pretty bold claims around their ability to virtualize GPUs efficiently and they have a whole private AI program that they’re pushing along with partners like HPE, Dell, and NVIDIA.

Chris: Yeah, we always talk about It doesn’t make any sense to run anything bare metal.

And then there’s always that little tiny asterisk in the corner, right? This is one of those times where that asterisk comes up because virtualizing a GPU has historically been a huge pain in the ass. Because most of the time when you’re virtualizing enterprise hardware, you don’t really care about the graphics card.

Ned: Typically not.

Chris: So your options up until now, even with VMware, have been a straight pass through. Pinning a GPU to a virtual machine, which defeats the purpose of virtualization in the first place. So yeah, just install bare metal. It’s easier. Stop it.

Ned: I had an interesting conversation with the folks over at VMware.

And if you’ll remember back in like the Kubernetes conversation where VMware basically said that you can run virtual machines for your nodes in VMware and have them be more efficient than physical nodes because of the deep integration that AMD and Intel have added to their chipsets. To a certain degree, VMware was actually more efficient than the native operating system or the bare metal operating systems you would lay down a Linux or a Windows.

They had some numbers to back that up. I don’t think the same is the case with GPUs yet, but NVIDIA has every incentive to add all the virtualization stuff into those GPUs to help support these types of workloads.

Chris: Yeah, it

sort of defeats the purpose of the architecture of a GPU chip, and it’s kind of out of the scope of this conversation, but what you just said is definitely interesting in the sense that I think it’s an ongoing effort and something to keep

an eye on.

Ned: Yes, and we’re not going to get into it. There is a whole component of the GPUs from NVIDIA that is all about multi tenant workloads and splitting up what’s actually on the card across more than one tenant for hyperscalers. I didn’t want to cover that because that’s just a whole other layer of complexity.

But, um, hey, since we’re talking about the hardware already, let’s delve in and talk about NVIDIA. The current king of the hill when it comes to AI hardware is the H100 GPU from NVIDIA. Now, that’s not a single card, but it’s actually a family of cards that come in different form factors depending on where you want to place them, but they’re all part of what they’re calling the Hopper generation of NVIDIA GPU architecture.

They’ve been using famous programmers throughout history to name their generations, so the previous generation was one after Dennis Hopper. Yes, as we all know, in addition to being a fantastic thespian, he’s also a well regarded programmer. Nobody questioned that. Don’t look it up. Anyway, so this one is the Hopper generation.

The previous one, I believe, was the Lovelace generation. We’ll get to the interface between software and hardware. Also known as the SDK and CUDA in a moment, but first I’m going to dig into what a typical GPU setup does through the lens of the H100 since that’s the current newest version of the GPU from NVIDIA.

This architecture is similar in all the other ones, just not as robust, let’s say. So for the starters, we have the GPU or graphics processing unit itself, which is composed of graphics processing clusters. Texture processing clusters, streaming multiprocessors, L2 cache, and high bandwidth memory. All of those components have corresponding initialisms, which I find incredibly confusing, if you weren’t already confused enough by what I just said.

So I’m gonna just use their full names for clarity.

I can’t believe you don’t want to just throw around phrases like GPC, TPC, SMC, and L2C. Like just, what’s the problem?

Can we talk about the overuse of initialisms and acronyms in IT? Like, can we all just like chill with this bullshit? Because I’m trying to read the documentation and it is nigh unreadable until I incorporate all of these initialisms into my brain.

Which is just, it’s overhead I don’t need.

IDK man. BRB.

Okay, I’m just, I’m just gonna leave it there because I’m too frustrated.

Would you go so far as to

say

Ned: that you’re FUBAR? I would not. Though I didn’t know that was an initial, uh, an acronym for many, many years. That’s

Chris: because it’s naughty. It’s got a naughty word.

Ned: Yeah. Bar. Yeah, that’s a terrible word. As far as I can tell from NVIDIA’s documentation, it goes like this.

The Graphics Processing Cluster contains the Texture Processing Clusters. And those contain the Streaming Multiprocessors. And inside of those, we have the CUDA Cores and the Tensor Cores. So if you’re imagining like a package inside a package inside of a package, that’s sort of how it goes. So when they list out the specs for a given GPU, they talk about how many graphics processing clusters it has, and then the number of texture processing clusters inside that, etc, etc.

So when you’re trying to pick apart the documentation, that’s the hierarchy. Then there’s memory on the card to serve all these different components. That’s the high bandwidth memory, and that’s presented through a set of memory controllers. And then there’s L2 cache, which is directly accessible by the graphics processing clusters.

You can get information into and out of the GPU either through a PCI Express 5 host interface or through NVLink ports that are connected to a separate high speed hub. We’re going to come back to NVLink at some point, but just know that it’s a lot faster than PCI Express 5. All of this is a lot to try to hold in your head, so I will provide a link to the docs if you want to try and wade through the details.

There’s also a very helpful image that has the sort of breakdown of the components. But allow me to hit some highlights on the numbers. So one of the form factors for the H 100 is called the SXM five form factor, because why not? That form factor has 16,000 cuda cores on it. 528 tensor cores. 80 gigabytes of Gen three hide bandwidth memory, 50 megabytes of L two cash and NV link Gen four with 16 ports.

I believe that’s a single card. In a single system. Neat. Yeah. Did I say 16, 000 CUDA cores? I think I might have. That’s

Chris: a lot of cores. Yeah, I was gonna say, how many do you think you actually need? Probably like 10? Like

Ned: 12 maybe, a dozen. A baker’s dozen, 13, why not? NVIDIA further expands that into the DGX H100 server line, which has a truly astonishing amount of everything.

For starters, the DGX server has 8 H100 cards in it. So 8 of those cards. Do the math, I’m not going to. It also has 2TB of system memory, 2 Intel Xeon Platinum CPUs, and about 32TB of internal storage all on NVMe SSDs. It also has, connecting the cards, 4 NV switches, which is what NVLink uses to connect multiple cards together, and it also has external ports on the back that support 400Gbps over InfiniBand.

So that’s what it has inside. It’s been abandoned a long time. Guess how big the server is. Uh,

Chris: well, I actually looked it up, so I know.

Ned: It’s an 8U server, Chris.

Chris: Does that mean you also know how much it

Ned: costs? 110 billion. The price tag for the server is in the hundreds of thousands, if not over a million.

Chris: Starting price is 300 grand. Well, there we go. Emphasize

Ned: starting. Yes, and if you put in that order, Stop interrupting

Chris: me. You

Ned: stop.

Even if you ordered it at that list price, chances are you wouldn’t get the actual server for like 18 months. Because hyperscalers have this on lockdown. The other thing I want to mention is that this server consumes a maximum of 10. 2 kilowatt hours. That will make more sense later when I talk about what an average server consumes, but just spoiler, it’s less than 10.

2 kilowatt hours. Now I know you’re thinking, Chris, you’re thinking, one server’s not enough. I need more. And NVIDIA says, you got it. They’ve put together what they call base pods for different industry verticals, like financial services. And the base pods are in collaboration with other hardware vendors, like your DELs and your peer storage, because what they need beyond the servers is some sort of networking and storage to tack onto them.

So the base pods are collections of racks of DGXs and the attendant storage. And if that’s not enough, you can scale up to the DGX SuperPOD, which, what the hell are we even doing here? Is this hardware porn? Yes. Yeah, it kinda is. But it’s also a real thing that somebody’s buying. So, the SuperPOD is based on scale units, and those scale units each have 32 DGX servers, InfiniBand switches providing the 400 gigabit per second, Single Direction Networking, and Separate Compute and Storage Fabric Modules.

So that’s a lot. And Max is out at 4 servers per rack. Because of power. Right. So this is a minimum. Minimum 8 rack purchase. Plus all this other

Chris: stuff. Incidentally is why you rent space to run your models because you can’t afford this.

Ned: Yes, you can’t afford it. And even if you could, you probably couldn’t get it because all the cloud hyperscalers have bought it.

Chris: And if you got it, you couldn’t run it because you don’t own a nuclear power plant.

Ned: Side note, there’s a reason why Microsoft is looking into creating small form factor reactors right next to their data centers. And it’s this. Ha! So, that’s why if you actually want to take advantage of any of this hardware, you’re going to be renting it from someone else for some ridiculous amount of money per hour.

How do you actually harness all of this raw computing power? That is through the CUDA Programming Platform and SDK, of course. CUDA cores are the things on the NVIDIA GPU that actually does stuff. They make your ideas spring to life with uncanny valley versimilitude. And yes, I actually typed that word correctly on the first go and then immediately misspelled the word correctly.

The irony was delicious. Now, if your ear hadn’t entirely glazed over in the previous section, you might remember that we have tensor cores and CUDA cores. What is the difference? I’m glad you asked, Chris.

Chris: Absolutely nothing.

Ned: No, there’s some differences. And uh, even more glad that someone else wrote a whole blog post about that, that I can crib from indiscriminately and include a link to in the show notes.

CUDA is an acronym, because of course it is. And it stands for Compute Unified Device Architecture, which is totally unhelpful and no better than calling it CUDA. So I’m going to keep calling it CUDA, because it’s fun to say. CUDA cores are like the original GPU cores, and they do math good fast. Math fast good?

Chris: Why didn’t you just say that?

Ned: Yeah, call them MGFs. Math good fast processors. Specifically, the kind of math that graphics processors do good is vector based floating point math. Something that CPUs are okay at, but really not great at. Which is why you needed a whole separate accelerator card to do it. It just so happened that data sciencing and all the other stuff we want to do for AI and other fields involves a lot of the same math that rendering vector based graphics Does, so CUDA cores can be repurposed for those applications.

Tensor cores are the new kids on the block, and I was going to try to make some sort of reference to please don’t go girl, but I was tired. So we’re just going to move on. Tensor cores are more specialized than CUDA cores. They’re focused on doing matrix math on four by four matrices with floating point 16 or 32 bit precision.

Please don’t ask me to be more in depth with that because my brain kind of broke right around there. The point is, there tends to be far fewer tensor cores versus cuda cores on a given GPU. For instance, the H 100 that we talked about earlier has 16,000 cuda cores and only 528 tensor cores. So something in the scheduler or the software needs to decide whether to send an operation to the tensor cores or do it on the CUDA cores.

So only if it’s really going to benefit from tensor cores should the jobs be sent there.

Chris: But it’s not much

different than the difference between sending a job to a GPU or a CPU. One is more specialized and with these TensorCores, it’s even more specialized there.

Ned: Yes, and from what I’ve read, there are even more specialized core types on the way for specific types of math, because of course there will be, as long as there’s a benefit to doing it.

Chris: Will they be chiplets?

Ned: They might be. I saw the word the Initialism IPU somewhere P . And when I was done laughing like a 5-year-old, I decided not to look any deeper into it. I’m sure it’s a thing.

Chris: I respect that.

Ned: Now, from what I can tell, there are two specialized libraries in the Cuda programming, SDK, for C plus plus CU BLAS.

is for Basic Linear Algebra in Generic Matrix Multiplication, or GEMM. Dear God, we have to stop. And then CUDNN is for Deep Neural Networks, which is what AI is using. AI is all about deep learning with neural networks, so it uses that CUDNN library. NVIDIA helpfully posted some example code using this library, which I found completely inscrutable, but hey, you know, if you’re a C person, give it a look.

Link in the show notes, like everything else. CUDA 12 is the current version, which actually supports the Hopper and Ada Lovelace architectures, which includes our H100 card. Interestingly, I looked up the support for that. And TensorFlow does not yet have a stable version that supports CUDA 12. It’s still on 11.

  1. So if you happen to be using the H100, I guess reach out to NVIDIA or build it yourself. If you are actually writing a backend that supports tools like Keras and TensorFlow, you’re going to be using this developer kit. And you’re going to be writing stuff in C because once you get down to a certain hardware layer, you’re It’s going to be C or something similar like Rust.

Because have some respect for yourself.

Dear God, yes. So that is the full stack. From the way you’d write your models, down to the software that supports it, to the scheduler that schedules the jobs, to the operating system and everything that runs it, to the hardware that’s actually going to support it. I do have a few other random things.

Do you want the other random things? Only if you

promise to go very quickly.

I will. I will do my absolute best. Connecting nodes. With the massive datasets that are flying around in AI cluster, and the need for guarantees on latency and delivery, most AI hardware uses InfiniBand instead of Ethernet for transport.

InfiniBand might ring a bell if you’ve ever been within spitting distance of an HPC cluster. The InfiniBand standard is a different protocol than Ethernet. It does not use MAC addresses or TCP IP to build the network. And without getting into the weeds, because Chris has told me I don’t have time for that, InfiniBand allows for extremely high bandwidth.

Something Along the theoretical limit of 900 gigabits per second in the H100 series, it has sub 100 nanosecond latency. Full CPU offload. Yeah, nanosecond. Non millisecond. When you pass data through a switch in InfiniBand, sub 100 nanosecond latency. Yeah. Potentially

literally talking about the speed of light.

Yes, basically. The way it does it, again, we’re not going to get into, go read a technical paper or something, but because of that, it also has end to end flow control for a completely lossless network, meaning there will be no retransmissions. I send the data and I guarantee you’ll get it. Kind of like Fibre Channel in that regard.

In fact, if you worked on Fibre Channel, a lot of the concepts map over. The downside is that InfiniBand is incredibly expensive due to the smaller total addressable market and licensing costs associated with the technology versus Ethernet and TCPIP, both of which are open standards. Now, InfiniBand was never really meant to compete against Ethernet.

It was actually meant to be a replacement for PCI, because at the time, in the early 2000s, PCI was seen as a serious bottleneck inside systems, let alone sending data between hosts. Even the documentation from NVIDIA pitches the superiority of NVLink over PCI5. Which is, NVLink is their GPU interconnection solution that uses InfiniBand.

Recently, Ethernet has been pushing back. There are now 800 gigabit per second capable ports and switches in the Ethernet world. Yes, really. And the Ultra Ethernet Consortium is focused on developing an open standard that meets or exceeds what InfiniBand does today. It doesn’t help that InfiniBand is produced mostly by Mellanox.

And Mellanox just happens to be owned by NVIDIA. Other hardware vendors don’t love it. Key to both of those technologies is the idea of Remote Direct Memory Access, or RDMA, which avoids the need for a processor application to request memory access on a remote system from the CPU. Instead, accelerator cards like a GPU just reach directly across the network and access the memory on another host.

The concept itself is not new, but HPC and AI use it pretty heavily, and unfortunately, the implementations of RDMA over Ethernet, including Rocky, R O C E, I guess it’s pronounced Rocky, DeepSci, and iWarp, which I’m sure stands for something stupid and I refuse to look it up, neither of those have been a runaway success, and they’re a key reason why InfiniBand is still preferred.

Two more things. Did I mention that a single DGX H100 server uses 10. 2kWh of power in 8 rack units? Just for a little perspective, the average power usage for an entire 42U rack in 2020 was 8 10kWh. So a single DGX server uses as much power as an average rack.

Chris: Well, more.

Ned: More. So, not only does this have serious implications for data center design, especially in terms of power and cooling, it also has some not great consequences for the environment.

I bet

you have a full example.

Uh, remember when crypto was using as much power as the entire country of Argentina? AI looked at that and said, Amateurs. AI is already on pace to match. what Cryptocurrency was doing by 2027 and most experts think that the only limiting factor is how fast NVIDIA can produce chips and how fast hyperscalers can build data centers.

You better hope that AI is good at avoiding climate catastrophes since it’s certainly going to cause Many. Lastly, if any of this has piqued your interest and you don’t want to take out a second mortgage to pay for a DGX, and trust me, a second mortgage probably won’t be enough, the good news is that there are a ton of free resources out there that let you try the tech.

I’d recommend checking out Microsoft’s AI for Beginners to get down some of the terminology and also get some hands on time. They’ve included a ton of content. Including LiveLabs, and you don’t need to pay for anything. You can also check out the sites I’ve linked earlier for Keras, TensorFlow, and Anaconda.

And you may also want to take a beginner course on Python, since it’s used so heavily. And that, is it. Sir, is how to AI. Well, I don’t know about you,

but I’m an expert. In

what?

Chris: I’m, I don’t know. I wasn’t really listening.

Ned: Excellent. Hey, thanks to the audience for listening or something. I guess you found it worthwhile enough if you made it all the way to the ends.

So congratulations to you, friend. You accomplished something today. Now you can go sit on the couch, fire up PyTorch, and get some cute CUDA cores churning. Wow, that’s hard to say. You’ve earned it. You can find more about the show by visiting our LinkedIn page. Just search Chaos Lever or go to our website, ChaosLever.co, where you can find show notes, blog posts, and General Tomfoolery. We’ll be back later this week to see what fresh hell is upon us. Ta ta for now.

Show Notes

The Anatomy of AI: The Tech Behind the Intelligence

Episode: 99 Published: 3/28/2024

AI’s infrastructure, from the coding languages that power the models, to the networking that connects it all. Plus, a look into what effects this has on our planet.

AI’s Infrastructure: How It’s Built and Powered

Ned and Chris embark on a journey through the world of AI infrastructure, touching on key software and hardware components that make AI tick. From the basics of Python and TensorFlow to the power-hungry NVIDIA DGX servers, this episode covers everything you need to know about the backbone of AI. They also explore the rapid world of InfiniBand networking, highlighting its importance as well as the challenges it faces against Ethernet advancements. Finally, Ned and Chris reflect on the environmental impact of AI’s power consumption, humorously suggesting we might need to find a new planet or a better power source sooner than later.

This is a rebroadcast of this episode: https://chaoslever.com/cl-20231031/

Intro and outro music by James Bellavance copyright 2022

Hosts

Chris Hayner

Chris Hayner (He/Him)

Our story starts with a young Chris growing up in the agrarian community of Central New Jersey. Son of an eccentric sheep herder, Chris’ early life was that of toil and misery. When he wasn’t pressing cheese for his father’s failing upscale Fromage emporium, he languished on a meager diet of Dinty Moore and boiled socks. His teenage years introduced new wrinkles in an already beleaguered existence with the arrival of an Atari 2600. While at first it seemed a blessed distraction from milking ornery sheep, Chris fell victim to an obsession with achieving the perfect Pitfall game. Hours spent in the grips of Indiana Jones-esque adventure warped poor Chris’ mind and brought him to the maw of madness. It was at that moment he met our hero, Ned Bellavance, who shepherded him along a path of freedom out of his feverish, vine-filled hellscape. To this day Chris is haunted by visions of alligator jaws snapping shut, but with the help of Ned, he freed himself from the confines of Atari obsession to become a somewhat productive member of society. You can find Chris at coin operated laundromats, lecturing ironing boards for being itinerant. And as the cohost on the Chaos Lever podcast.

Ned Bellavance

Ned Bellavance (He/Him)

Ned is an industry veteran with piercing blue eyes, an indomitable spirit, and the thick hair of someone half his age. He is the founder and sole employee of the ludicrously successful Ned in the Cloud LLC, which has rocked the tech world with its meteoric rise in power and prestige. You can find Ned and his company at the most lavish and exclusive tech events, or at least in theory you could, since you wouldn’t actually be allowed into such hallowed circles. When Ned isn’t sailing on his 500 ft. yacht with Sir Richard Branson or volunteering at a local youth steeplechase charity, you can find him doing charity work of another kind, cohosting the Chaos Lever podcast with Chris Hayner. Really, he’s doing Chris a huge favor by even showing up. You should feel grateful Chris. Oaths of fealty, acts of contrition, and tokens of appreciation may be sent via carrier pigeon to his palatial estate on the Isle of Man.