Transcript
Transcript prepared by Bob Therriault, Igor Kim and Sanjay Cherian
Show Notes
00:00:00 [Tali Beynon]
So with named axes, there's a sort of natural way to broadcast up. It's because you don't really have to make a choice. It's like if an array has extra names that a particular function wasn't expecting, it's obvious what to do. You map over those names, right? So what went from being a result, like a single result, goes to an array's worth of results.
[Music]
00:00:31 [Conor Hoekstra]
Welcome to episode 67 of ArrayCast. I'm your host, Conor. And today with us, we have the same guest as last time. So we will get to, I guess, reintroducing him or maybe not in a second. But first, we're going to go around and do short or brief introductions from our panelists. We'll start with Bob, then go to Marshall, and then to Rich.
00:00:48 [Bob Therriault]
I'm Bob Therriault. I'm a J enthusiast, and I'm looking forward to the discussion today.
00:00:53 [Marshall Lochbaum]
I'm Marshall Lochbaum. I've been through a lot of array languages. Started with J and worked for Dyalog for a while. And now I'm working on my own language, BQN.
00:01:02 [Rich Park]
I'm Rich Park. I'm an APL programmer and educator working at Dyalog Limited. I'm incredibly thrown off by the sudden decision to start announcing the podcast numbers at the beginning of the episode.
00:01:13 [CH]
Change is good, Rich. Change is good. And as mentioned before, my name is Conor. I'm your host, programming language, polyglot, huge, massive array language fan. It feels like we're living in a golden age of array languages right now. But really, these array languages, a lot of them were created like five, eight years ago, and I'm just learning them now. So it is, you know, part golden age, part we're just learning things from the past. And that brings us to our only announcement for today, which is from me, which is that some of you may or may not have seen, I apologize, I think I actually alluded to this maybe like a month ago, or maybe even like months ago that I was going to do some Jelly live stream at some point. [01] And I gave like 24 hours notice when I decided I was going to do it. So no one, no listeners to this podcast, unless if they follow me on Twitter or YouTube would have known. And I apologize for that. But anyways, there was an eight hour stream on Saturday and then a five hour stream on Sunday, 13 hours of content. I recommend you go watch every minute. And if you don't want to do that, I probably, I maybe, I maybe will create like a 10 minutes, you know, 13 hours of streams in 10 minutes, because it was, it was very fun. And it was all about demystifying the Jelly programming language, a language by Dennis Mitchell, the individual behind tried online. And we love Jelly, you know, we got to do we got to do some real reflection is we want BQN, CAP, or Jelly, our favorite language, we don't know. There's so much. There's so much to be excited about, folks. I mean, Jelly is from what 2015 16. And BQN is from 2020. Uiua from now. And Kap is also from 2020. Anyways, that's all I'll say links in the description, folks. And I once again, apologize for not giving more heads up. I got excited. And that's what happened. Anyways, I'll leave it there. And there's no probably no comments. I mean, Adám is the one that did the the Jelly, Jelly beans, we'll call it website. So we will hopefully get Dennis in the future. And Adám will be there. We'll chat about it. Wrapping that up. Moving on to part two of our two part conversation with Tali Beynon on. If you are a regular listener, you will know Tali from our last episode 66, where we talked about a plethora of stuff. A lot of stuff I didn't even know we were going to chat about. It was fantastic. I'm not going to recap the whole episode, [02] you should just go listen to it. But it was fantastic hearing about your intellectual journey over the last decade plus I sound it sounds like working for Stephen Wolfram at one point no longer working for Stephen Wolfram worked on a ton of interesting stuff with MATLAB. Like I said, if you haven't seen the episode or heard the episode more accurately, go and check that out because we are basically going to be picking up right where we left off, picking up right where we left off.
00:03:51 [RP]
Mathematica -- doesn't matter.
00:03:54 [CH]
Mathematic -- doesn't matter. It definitely does. But yes, I'll throw it over to Tali. And maybe I mean, we ended last episode 66 by sort of previewing this rainbow array programming model, which has some overlap with ideas that Marshall has with his sort of iridescence language. And so I'll throw it over to you. So maybe I'm not sure if you want to expound a little bit on what you said at the end tail end of the episode, because we only had a few minutes and we were way over time. And then we'll go from there.
00:04:26 [TB]
Thanks, Conor. Yeah, so maybe I can sort of set the scene a little bit like what, what is the sort of, what's the problem, I guess. And I guess the context is, you know, as everyone has used an array language or even just manipulated arrays and other languages knows in traditional languages, knows arrays have this sort of like concept of axes baked into them. And it's something like Python, NumPy, deep learning frameworks, TensorFlow, you order these axes that kind of live in a canonical order. So vectors have one axis, so order isn't really relevant. Matrices have two axes and we, you know, call them rows and columns and think of rows as being the first axis, for example, often, not always, and columns is the second axis. And then it gets more exotic as you add axes and harder to think about and talk about. Certainly you can't visualize arrays with four or five or six axes in any sort of intuitive way. And that's the dominant paradigm that we've lived with since forever. But I think there's a, there's a growing recognition by some people who, some practitioners in different areas that this might not be, that might carry certain limitations to sort of stick with that, that paradigm. And so there's a broader movement, I guess you could say, or philosophy that instead of numbering our axes, we should name them. And that gives us an opportunity to, first of all, document what the axes are supposed to mean in any given context that you construct an array or transforming an array into a new one, or combining several arrays. Instead of saying, I'm going to take the tensor contraction of the third axis of this array with the second axis of that array, which might be very meaningful. It might correspond to doing some kind of color blending operation on an image. You know, you might be, I don't know, tinting an image or turning it to grayscale, or you might be doing something in deep learning, like doing a convolution on an image with a set of convolution kernels. But it's sort of not documented in the description of that operation when you're just sort of using these numbered axes to do that. I mean, sometimes you'll have a function that does that for you, and the name of the function will tell you what's going on. But you, then you sort of push the problem up one stage, you've got to make sure that you're feeding data and the feeding data in the right form, like the way that the axes are ordered in the input data sort of matches what that function expects. So I think this has become most acute in deep learning. So with today's modern neural network architectures, you have quite a lot of axes floating around, and often not in a sort of standard order. There could be many reasons for this. It could be that like at different points of the network where, you know, arrays are being expanded and collapsed and changing shape all the time. It's sort of hard to maintain any kind of convention. It might not be possible to maintain a convention. It might also be that the little functional gadgets that you're using, the kind of optimized GPU kernels that are making everything fast, making it possible to train neural networks, expect particular semantic axes, you know, like feature dimension or X and Y image position, pixels in an image or color channel or whatever. So that's called those semantic dimensions, right? Like we know what they are. We don't really care where they are in memory, but we know what they are. The library that you're using to make all of this fast might expect those in particular positions. And so you've got to make sure that you get them into that position in order to be able to call the function and have it do the right thing. And furthermore, it might be that if you can massage your tensors, your arrays into the right shape, you can get massive performance benefits from doing that. And so it might be that there's like a natural way to set things up, but that's not actually the fast way. And so you're sort of forced into doing a lot of juggling of axes, maybe even something that's conceptually wants to be a matrix because there's sort of two independent axes that are semantically distinguished should actually be flattened into a vector to make some kind of matrix multiplication that you want to apply to that really faster than doing it the kind of quote unquote correct way. So one of these things are sort of fighting you. And one of the reasons that we experienced this like friction is that we are doing, we kind of got one foot in the world of human interpretable things. Like we, you know, we look at an array with a certain number of axes and a certain shape, and we kind of, in our heads, we know what it's supposed to mean, but then there's the demands of the machine or the GPU. And so on the other foot, we kind of trying to keep things close to what the metal wants. And I think that is sort of a painful position to be in because you can't really win at either of those tasks of making things intelligible to humans and making them fast to execute simultaneously. [03] Like they're orthogonal. And the last thing that you want to do is take a program, an array program that you know is correct, that's doing something quite sophisticated and know that if you just re-architected it, you could make it five times faster, but it would become undocumentable. It would become so complex that like no one could ever maintain it in future. So I think that tension is something that needs to be solved with tooling. And certainly everyone sort of knows this and there's progress being made in that direction. Various languages are trying to tackle that problem head on, various frameworks. And I suppose I'm interested in the kind of more conceptual level. So I'm not designing a language. I'm sort of, mostly for myself, designing a way to think about what a better way of doing this would look like. What it would look like to attach as much semantic information to axes as possible and how it would change the way that we would express an array program. So that's the problem statement. I don't know if that's clear enough or if that leaves open questions or people want to debate anything I've said there.
00:10:35 [BT]
Not to debate, because I'm not even close to being able to debate this. But if it's set up to make it more human understandable, this kind of naming of axes. And it seems to me that in doing that, you're transferring some of the work over to the computer. Because a lot of the strides and things like that that we use in arrays are no longer available if we can swap the axes all around just automatically. There's got to be a load that you're carrying with the axes to be able to keep things translatable. First thing is, is it worth the payoff to do that? And then the second thing is, is it something that could be done temporarily so we understand what's going on and then removed so we could take it back to a more machine-friendly version?
00:11:17 [TB]
Well, I think that's a great point. You can see it as a sort of spectrum. When you're debugging a toy program, or rather when you architect something that you want to trace, like you want to see exactly what's happening to numbers on a screen. Maybe you have some sort of fancy visualization that you want to engage. Then it might be really helpful to know exactly how things are laid out in memory and follow step by step what's happening and ensure that things are roughly what you expected. But that doesn't mean that that has to be what the final executable form of some complicated array program will be when it's running in production on vast datasets. Especially when you think that it might be quite difficult to explore all of the different choices you have about how to order axes or how to compile things down to individual kernels that will run on a GPU or CPU. That sort of optimization process is very expensive because it's so open-ended. The longer you expect to be running an array program, the more effort it makes sense to invest in that. Automated effort, I mean. So I feel like just like we have optimization flags for traditional compilers, like O3 might be really good for performance, but you'll wait maybe twice as long for that to happen and it'll make debugging much harder. The same sort of set of knobs and philosophies should apply to array programs. That if you know you're going to be doing something really hectic on a lot of data, then let the computer super optimize. Maybe let it even do profile-guided optimization where it gets to run on real data. The shapes of the arrays are then known at runtime and it can try to decide if due to cache locality, whatever it is, that there's some particular way that your very high-level conceptual idea of how you want to manipulate these arrays is best implemented in hardware. And at the extreme end of that, maybe one day we'll have FPGAs that are actually in all of our hands and all of our devices, where you're literally setting up circuits on a chip to do a particular array program that turns out to be very important or that you want to run often. So there needs to be a spectrum. We certainly want to have tools or languages, frameworks that float above and don't target any particular place on that spectrum, but let you just express in a maximally flexible way what you're doing. And then as time goes on, you can tune it into exactly where you want to be on the performance/debugability side of things.
00:14:02 [ML]
And to maybe make this explicit, one thing that a more abstract source code gives you is that instead of saying, "All right, the human is going to optimize this, pick out their axes, and hopefully they're an expert who can do this and pick the format that's best for where they're executing." If the human does this, they're specializing for one architecture probably. So if you've got one bit of source code that maybe even for small sizes should run on the CPU because the overhead of moving data to the GPU is too high and for larger sizes should run on the GPU, an abstract format is the only way you're going to be able to specify that reasonably because the choices that you would make on a CPU versus a GPU are often going to be pretty different. So having the flexibility to say, "The computer is going to decide how to optimize things." And maybe it doesn't even do a good job, but it can do this with the knowledge of where it's going to run because you're not going to have a human re-optimize each time you're running your program.
00:14:59 [TB]
No, totally. I'll add another example of what the power of a very high level format for array programs, I should say, not languages, array programs would be. Another would be that if you have this very abstract representation of what an array program is doing, you might be able to know in advance what the time versus memory trade-offs are of various execution plans for that. Absent any particular architecture in mind, you just know, "Well, if I choose to evaluate these chunks of array operation in the following order, what am I forced to keep in memory at one time versus what can I recompute?" And this kind of thing. It's not a novel thing. People do this already with deep learning, but it's not at the most abstract level I can imagine. And I suppose with things like MLIR, [04] which is one of the technologies that I think Google works on with others, that would be an example of these sorts of ideas are out there, but maybe specialized to the deep learning side of things. And it's like, yeah, if we fully zoomed out and just thought of array programs as a very generic phenomena that don't have to be about deep learning, they don't have to involve back propagation, they don't even really need about numbers.
00:16:14 [ML]
Well, they can be JSON parsing. It's a great example for array programming and the way it's been implemented is, I believe, in C++ as an array program, because the fastest way to parse JSON is using array primitives, but there's no array language that can handle this in the fastest way.
00:16:33 [TB]
That's super interesting.
00:16:34 [CH]
So just to make sure I'm on the same page as everyone, or at least most of the folks in this conversation, I totally agree with the dichotomy or the friction between writing the most expressive version of a program possible and having to balance that with whatever perf you're trying to get. That's just a problem that exists in every language. Python is the greatest example probably of a language that is very expressive. It's probably the closest thing we have to a pseudocode language, but performance-wise, I think the number that people throw out is it's 10x slower than other languages. And there are certain libraries that are basically implemented in C, and that's what it boils down to at the end of the day, like NumPy and Pandas and stuff. But still, when running that compared to GPU-accelerated stuff, it gets blown under the water, which is why you see projects like Mojo, which is being built on top of MLIR, that is basically trying to take Python and add a bunch of this stuff that's less expressive, adding types to a program. You could argue that that makes it more expressive, but it's more boilerplate, it's more ceremony, if you will. And if you've taken a look at the 20-minute intro Mojo talk, progressively, as you unlock more and more performance, you're adding more and more information about. I think they're adding a struct type and whatnot. So I totally agree with that. The piece that I'm not totally clear on is, as a part of this getting to the Rainbow array programming and named axes, are you arguing that this solves both or improves on both sides of it, that it enhances readability and expressivity, especially when you're in a specific domain, and there's this state that's in your head that, whether it's the RGB example, or you're in finance and you're dealing with OHLC, stock price info over different tickers, over different times, over different markets, all of those axes could be named. But in certain cases, you might just be dealing with some 4D or 5D thing. And so it increases expressivity. But on top of that, because we start doing that, somehow you unlock a way to also get perf benefits. And that was the tooling you were mentioning? So I can definitely see how potentially naming axes increases the understandability and readability of certain programs. It's unclear, though, if you're also arguing that if we do this, it opens a path towards. We have a high-level abstract language, and therefore, if we build the right compilation techniques, we can target different hardware and get maximally efficient code with also maximally expressive code. That is the ultimate dream of NVIDIA, of course. We want people to be able to easily write their code and then have it run super fast. But right now, there's always a trade-off. You can write in Python and get some amount of acceleration, but you're never going to beat some hand-tuned CUDA ninja that is writing kernels from scratch that knows exactly what hardware they're targeting, et cetera, et cetera. But having both of those at the end of the day would be amazing. Anyways, I'm not sure if my question was clear in that two or three minute or five minute ramble there.
00:19:54 [TB]
No, I think that was perfectly clear, and it's a great point. And I think I am claiming that, yeah, those are two orthogonal benefits, I think, of named axes. And I'm not saying that that's the only part of the story, that named axes are going to solve everything. But I think that, yeah, they do offer both massive increases in readability and documentability, robustness, especially when coupled with shape checking, like making sure that the shapes of all of your arrays make sense. And that you've not only haven't made shape mismatch errors where you're trying to combine two dimensions that don't have the same size, something that you can verify at runtime or at compile time if you have a small array language, but also that you didn't accidentally combine two things that happen to be the same size but are semantically different. That can quite easily happen. You can have square matrices where the rows and columns still mean different things. And if you transpose such a matrix and use it in the wrong form, you use it transposed, ordinary dimension checking won't surface that error. So there's a story about robustness, documentability, human readability. But separately, as you point out, there's a story about optimization.And it's that as soon as you decouple these named axes from particular commitments to memory layout, because that's ultimately where the whole array access order thing, the rubber meets the road, is like, how is a shape that is whatever four-dimensional going to be laid out in one-dimensional memory with strides [05] and so on? That's where it actually has a performance impact in the end. And by removing that commitment, you do free up compilers to just decide. So how do I think the memory access patterns of this particular array program are going to prefer these axes to be ordered? Especially if I want to parallelize over a whole bunch of tiles on a GPU, it might be clearly better to do it one way versus another. So those are things that a compiler can tackle. And it can either have good heuristics. This is very similar to how this ordinary scalar code is compiled. We have plenty of heuristics on when to inline things and when to unroll loops, this kind of thing. It's very similar to that. But then there's a different thing. And it's just like this observation that I've maybe made in the earlier episode. It feels so similar to the situation we were in before the advent of structured programming, which is like, it's such a crazy long time. So long before I was born that I had to read about the bad old days when we didn't have named variables. Forget about named axes. You would write assembly code, but you didn't have functions. It was all just one big blob of state. Whereas people argued that this was a bad thing and that we could structure things, we could abstract out patterns and deal with logical concepts like variables as logical names for locations in memory that we don't particularly care where they are. We'll leave that to the compiler. Or for local variables, things like registers. You don't want to have to be deciding that your foo and your bar should go in register one and register two on your CPU. That's for the compiler to decide. And it may be that, oh, they can't fit in registers. They've got to be on the stack or they've got to be in RAM. And that's just not relevant for 99% of the programming that we do today. And it's better that it's not, because those decisions are better left to compilers. In theory, you could hand optimize something that's a little bit better than the modern optimizing compiler, but it would take you so long to write substantial amounts of code in that way. And it would be so error prone that you still get a net performance benefit by taking a big chunk of code that you have to write and allowing an optimizing compiler to do the best it can with that, which you wouldn't be able to do by hand. So I think the same argument applies to array programming, I hope.
00:23:57 [ML]
The point that I would make about that is, I mean, and you can even draw the analogy to assembly as well. I don't think that being able to make these optimizations requires you to drop the access ordering in the source code. So you can have the same programs that we write now. And in these programs, even something dynamic like NumPy, pretty early on the rank of all the arrays is going to be known. And at that point you can go through and figure out where all the axes go. And so a compiler could start doing this without having the names, just by dropping this requirement that things are laid out in the particular way that the access ordering suggests. And that's the same thing as assembly registers. Like when you have a C program, basically the first thing the compiler does is drop the names and just number everything or make them references internally or whatever. And that as part of that, it's going to impose some temporary order, but it doesn't care about that at all. So I mean, it doesn't either way, if you had names or if you had things ordered, but you have a semantic model that says the compiler doesn't care about the ordering, you can do the same optimizations, I think.
00:25:10 [RP]
Well, that was the discussion last time about the what are keys and whether the arbitrary keys could happen to be some ordered, but that's for the writer to care about, like you're saying. There's some ordered integers for the... Remember last time there was like a outer product of whether your keys are ordered or not. Yeah.
00:25:32 [ML]
Well, what I'm saying is that for optimization purposes, it doesn't matter because if the programmer gives you names, obviously you can't do anything with the names. You're just going to ignore them. You're going to make them, collapse them out to some references or whatever that you can compare and nothing else. And if the programmer gives you an ordering, you can make use of that, or you can just ignore it and say, I have a smart compiler, I'm smarter than the programmer. I'm going to rearrange and hopefully you're smarter than the programmer. Maybe not.
00:26:04 [TB]
No, I think that's true. And there's even a fun kind of analogy with what CPUs do internally, which is even though the compiled program has specified, oh, this goes in register one, this goes in register two, CPUs think that they're smarter than the compiler. And they are. They will recognize the fact that actually there's way more, like what modern CPUs have, like these register files that are like way bigger than the actual number.
00:26:31 [ML]
Yeah, like hundreds of registers.
00:26:33 [TB]
So what it'll do is it'll notice that there are all kinds of optimization opportunities because of lack of dependency among these. You might be doing two separate things that are semi-parallel, like you're adding two numbers here and you're multiplying two numbers there, and they don't really depend on each other, but the instruction stream is serial. And you're waiting on RAM to reveal what one of those numbers was and you're sort of twiddling your thumbs. So the CPU is going to actually do the other thing at the same time. But it might be that there's some sort of partial dependence that it's like, it's actually using the same register, but it was going to be rewritten later anyway.
00:27:06 [ML]
Yeah. Well, even this can happen in a for loop where you have, obviously you don't want to write two copies of the inside of the for. Well, I mean, compilers do this a lot, but sometimes they might not want to write two copies of the inside of the for loop just to use different registers for the second copy. So you'll have a loop where it's using the same variable, but the variable only exists for the span of one iteration. And on the next iteration, there's something that's in the same register because it uses the same instructions, but it doesn't need to be.
00:27:35 [TB]
It's like the CPU is maintaining an illusion for the benefit of compiler writers, ultimately. They want to see what looks like a completely serial thing with these fixed registers. And actually what the CPU is actually doing is just thinking about the causal dependencies. The names are irrelevant, like the register numbers or whatever are irrelevant. It's the causal dependencies among these bits of state that are relevant. And it can extract parallelism by understanding that web of dependencies and doing a topo sort on them, whatever it is that CPU designers actually do. So I hear what you're saying. Your point is you could say, okay, in a traditional array ordered language, I've got a matrix. It's whatever column major, what row major, I don't even remember which one is which. But the compilers realized that that was the wrong order for this to be fast. So it's going to do the other thing. It's going to maintain the transposed version backed by memory that I don't get to see. And my program's faster. I didn't realize it was kind of like lying to me. And if I try to observe, it'll look right, but it's like pulling all these tricks to make it work. I agree. That's certainly possible. And it's like we could grant, and people are grandparenting that kind of approach into traditional sort of paradigms. I do think it's simpler though, if there doesn't have to be this weird illusion going on. It's just like there was never a guarantee. There's never even a concept of what the access order is because the only reason it's ever relevant is when you want to look something up in the array, and then you've got to choose, well, how are you going to specify the indices to this array to get the art done?
00:29:17 [ML]
Yeah. Well, it's easier for the programmer to understand the performance model, I might say.
00:29:21 [BT]
On the part of the programmer, bringing it back out to the programmer, what's required of the programmer? How do I decide what. . . Is it like I'm going to have to type my axes? Is that how it works?
00:29:35 [TB]
Well, so what does it mean to actually use named axes? Well, I suppose it means that every time you make a vector or matrix or three array, four array, five array, whatever, you've got to name those axes. And you're not just naming them as busy work. It's not just that it's a form of documentation, which might still be very useful. And again, I refer back to my experience from deep learning. Complicated architectures are just strewn with documentation that's reminding the reader of what shape a particular tensor is supposed to be. And it's not filled in with numbers, of course, because that is actually mostly irrelevant. It's filled in with names. So it's like it's there anyway, right? It's not just busy work or a form of documentation. It's also that when it comes time to combine arrays, in practice, you just find that when doing element-wise multiplication, for example, when you just want to. For matrices, it's called whatever, Hadamard product. [06] It's like just. Element by element. Element by element. It's not matrix multiplication, the kind of linear algebra stuff. It's just, yeah, element-wise. For ordered axes, it sort of doesn't make sense to do the matrix in a vector. Like, what would that mean? Like, which way do you want to broadcast it? And you can invent rules for like how you will consistently broadcast. But in practice, like, certainly for me, and maybe I'm just not experienced enough as an array programmer, but I've always struggled to just build a little mental model of what will happen in a particular situation and then decode, like, how do I force it? How do I set it up so that the convention will do what I actually want? And it's like, you're playing this silly game that in a way you don't really need. You shouldn't need to play.
00:31:21 [ML]
I definitely do get that feeling when. I mean, after nearly 15 years of array programming. So it doesn't go away.
00:31:31 [RP]
Yeah. Not to clickbait this too much. Is this leading axis is dead?
00:31:36 [ML]
Well, I mean, the leading axis is a very nice model for leaving out that information that you don't necessarily need. But as you get more axes, that model really falls apart.
00:31:51 [RP]
Yeah. I've seen this in even the basic APL tutorial. A few, as soon as they start introducing the concept of multi-dimensional arrays, I mean, largely it's sort of aimed at people using it for business applications purposes. So they're immediately saying, or even just assigning names, being tokens for each of the numbers that are the corresponding axes and saying, so instead of using the numbers to index each of the dimensions, we're now using the concepts, essentially naming them in a roundabout way. So yeah, it's pretty common.
00:32:29 [CH]
I mean, it reminds me too when we - I can't remember if it was episode, the tacit episode number two or three or something, but at one point, Adám and Marshall, we had gotten to the point of talking about dyadic transpose, which is still a glyph. I mean, I know the transpose, but I've never used it dyadically. And, you know, I'm lost at that moment. But then Adám and Marshall get in this back and forth because the behaviors in BQN and Dyalog APL are slightly different. But then they start, like, I mean, I understand that this is Rich coming from an array podcast where all of our languages are symbols. So probably we all sound like we're speaking Greek to our listeners half the time, but like amongst the, you know, the few people that were on that recording, Adám and Marshall start going back and forth and they're like, oh yeah, that's just a one, two, three, four, you know, one rotate. And like for like two or three minutes, I remember Bob too, is like, we were both just like listening to this being like, what is happening here? But clearly like you two are both so, you know, versed in years, like also I know, Bob, you have a lot of experience, but just like, I have never used dyadic transpose and listening to Adám and Marshall go back and forth, I was just like, holy smokes, like, I don't think I'm ever going to get to that point where, and you guys were just so much on the same wavelength, like finishing, oh yeah, yeah, yeah. You know, that's a good point, you know, and you do the two rotate, blah, blah, blah, and I was just like, holy smokes, like clearly you can get there if you spend enough number of years and time, you know, both of you are implementers of you with BQN and I and Adám with extended Dyalog APL, but like, is that the best model when you're getting to like, you know, rank six, arbitrarily ranked array? Yeah, maybe not.
00:34:07 [ML]
Well, so I will say that largely the point that I was making in that exchange was that in practice, the vast majority of dyadic transposes, and I mean, this is the operation that is able to reorder the axes in any arbitrary order. The vast majority of these are used just for simple things like swapping two axes or putting one axis somewhere else in the array. I mean, so there are various conclusions you can draw from this. One of them is that APL is really not used for all this, you know, serious axis heavy machine learning stuff. And another is that this axis heavy stuff is actually a fairly niche application. But I think it's also very interesting to look at, you know, what happens, like, are we by choosing APL, are we constricting the way we write programs to follow this model of, well, I can only move one axis at a time, because, you know, having a data layout where I move more than that is just so complicated? Or is it that, you know, this is really something that's rarely done and applies to a few fields? I don't really know.
00:35:14 [BT]
And part of the confusion as well is the Tower of Babel thing, because J does the dyadic transpose in a very different way than APL does. In fact, I think they're inverse of each other.
00:35:26 [ML]
Yeah, well, I mean, part of the reason that I did that is because in BQN, you can very easily write transpose inverse, and then get the J model. If you do it the other way around, the inverse doesn't give you as much, you can't express some things that take diagonals, combining axes.
00:35:44 [TB]
I wanted to pick up on the dyadic transpose. I assume that's transpose with an argument, which says I want to send these axes into these other positions.
00:35:53 [ML]
Yeah, it's just a list of numbers.
00:35:54 [TB]
Parameterized transpose, yeah. So Mathematica has the same thing. I think a lot of frameworks have that idea, as opposed to just swapping the first two things.
00:36:05 [ML]
Well, yeah, I mean, there are various ways. APL's monadic transpose actually reverses the order of all the axes, which I don't like. And in BQN, I just put the first axis to the end.
00:36:16 [TB]
To the end, okay, that does seem very sensible, yeah. So can I comment a little bit, Marshall, about that, the axis-heavy point? So I think probably yes. It could be that in deep learning, where you've got a lot of axes, and I'll just give examples for people who are not into deep learning. You'll have things like a batch dimension, [07] which is just, when you're training, you train on maybe 64, 128 examples at once, and you're doing the same thing, you're just mapping over the batch dimension for the 99% of the time. You might also have sequence position. So if you're doing a transformer, it's operating on a whole sequence of tokens. Those are indexed by some dimension, some axis. Then you'll have feature axes. Those are like the units of a neural network, they're kind of neurons, so to speak. Usually the more, the better. Obviously, you choose them wisely where you're going to spend your budget, but you'll have whatever, a thousand components that represent the activities of a thousand different neurons in a single layer, multi-layer perceptron. And then you can have, especially attention as a transformer is a very good example of where it gets really complicated, because all of these things are simultaneously floating around. You've got batches, you've got sequences, you've got feature dimensions, and then you've got this weird phenomenon where in a transformer, people aren't familiar with this, it's like these tokens, which you can think of them as words, but they're little components of the words, little chunks of letters. They have an opportunity to look at each other. And that's what a transformer does, is it gives an opportunity for all of these tokens to send information to each other and to figure out what they need to pay attention to from all their surrounding tokens in the sentence. And step by step, you gradually transform these tokens, transform themselves, update their states and "figure out" what the sentence means. Obviously it's very complicated to hard. We don't really know how it does that, but that's the basic idea. So the fact that there's pairwise interactions between tokens, tokens are looking at other tokens, means that sequence access will come up twice. So now you've got yet another access there. So the point is taken that deep learning might be particularly access heavy, but I do want to make a claim, which is that even before I got involved in machine learning, deep learning, a phenomenon I picked up just in ordinary programming and Mathematica, which is quite array heavy, it's quite list heavy, it's a Lisp, was the need to take something that's working, but it's not parameterized by a thing yet. You've made one triangle, but actually you wanted to make 30 triangles that were parameterized by 30 colors. And you've got to take your program and figure out, okay, what are the little combinatory little bits that I've got to wrap my program in so that it's going to correctly map over this extra access that I introduced. And it's easy if you're always adding those axes on in a sort of onion nesting fashion, but sometimes you'll discover, oh wait, I have a whole bunch of differently sized triangles and a whole bunch of differently colored triangles, but I don't want to take the outer product of those two things. I actually want them to simultaneously interact somehow. And then it's like, then it gets much harder because you have to turn your program inside out to feed the data into the core of what you're doing in the right way. And what I eventually realized is this is kind of like, sort of, it's a broadcasting story, but it's applied to not really arrays, but to sort of more general functional programming. And I think that's where the axis order thing really becomes a very painful uphill battle because you're having to take a program that was missing some axes, so to speak, and add them on. And every time you do that, adding an axis on the beginning or adding an axis on the end will kind of tinker with all the plumbing that you built your program out of and you've got to compensate. And it feels like there's a missing level of abstraction that would just make all of that go away so that you could just target exactly the additional degrees of freedom that you want to introduce when you're updating your program. So my claim would be that that phenomenon is fairly common. It's not about having lots of axes. It's just about how when you need to add or remove an axis, how that cascading change can make it really painful to change a program that has axis order baked into it at a fundamental level.
00:40:55 [ML]
Yeah. And I think that use case probably wouldn't show up in the dyadic transpose test. You would still only, you would hardly ever change the ordering of the axes, but at the same time, you really want to know the correspondence between axes in order to, because some array might have the axes that are, you know, zero, one, and three in the total ordering and another one's got one and two. So if you just have ordered axes, there's no way you're going to like, you have to think very hard to figure out how those correspond. And so the names would still definitely get you a lot there, but it wouldn't be something where you would reorder axes.
00:41:34 [TB]
Makes sense.
00:41:36 [BT]
So would you still need to declare the size of the arrays the way you would the axes?
00:41:41 [TB]
So I suppose that's a question for a particular like implementation of named axis sort of systems, right? I don't see why it would be necessary is the short answer. So like there's obviously costs to, depending on like how sophisticated your compiler is, there's costs to having runtime lengths. There's debugging costs in that like, you might not know if two shapes are compatible in the way that you're using them. If you're combining two arrays that like have a mismatching axis size, then you might want to know that at compile time. And if you have runtime sizes, you won't know that and you'll just get a kind of an error at runtime. But certainly like size polymorphism is the easiest thing to achieve. So I think like we can take that for granted. Like the story doesn't really change with named axes there. I think where it becomes really interesting is like, what is the equivalent of, of there's a kind of, you know, we have rank polymorphism. [08] So like, you know, you write a function for vectors and it'll like automatically map over matrices, for example. And the semantics of that get quite a lot more interesting and I would argue useful with named axes. So with named axes, there's a sort of natural way to, to broadcast up. It's because you don't really have to make a choice. It's like if an array has extra names that a particular function wasn't expecting, it's obvious what to do. You just, you map over those names, right? So what went from being a result, like a single result goes to an arrays worth of results, whose axes are precisely the unexpected names, access names that were present in the input. And I think that's like, it's so clearly like a time saver to, to work with that. But I think that's probably the tip of the iceberg. So I think there's like way more obvious semantics around how to make the, like you could call them, maybe I should say something for like, for everyone listening is that I've been using this term named axes so far. And like my particular weird way that I like to look at, I'm very visual person and I really like making diagrams and I really like color. So I suppose I want to point people again at the Rainbow Array blog post, which it sets up the whole named axis kind of, it doesn't really make a super strong argument for named axes. It's just really like a bespoke presentation of the underlying ideas. And I leave it to other people to make the arguments. And I'll, again, I'll shout out tenses, consider harmful. It's like a great place to start looking at why these people in deep learning are like, want to go past ordered axes. So I focus more on the like diagrams, colors. So when I say named axes, like for me internally, I'm thinking colored axes. I'm thinking very visually about stuff. One of the reasons it's really nice to use color is that like, if you want to describe, say, summing a matrix across its rows, like getting rid of the rows or the columns of a matrix by aggregating it along, you know, that dimension, you can certainly write like sum and then underneath you put a little, like a letter or word that's like the name of the axis that you're getting rid of. But it's, it's just looks so much nicer if you can just color the word sum, it's really obvious what it means. And it's, it's like, it's also well suited for debugging because like our brains are so, we've got baked in circuitry that can like match colors. And so it's almost like attention. So speaking of, I talked about transformers earlier and transformers are built on, on like attention. Like how do you focus attention flexibly to different parts of input data? So for like color, you know, you see a diagram and it can have lots of different colors in it. And like it all washes over, you don't really know what you're looking at. But as soon as you pick a color and you know that you want to focus on red, for example, it's, it's like you got a little filter in your head that will just like mask out everything that isn't red. And you can instantly see where the other red parts of the diagram are and like float to them and see them go ahead.
00:45:40 [BT]
Well, the thing I was going to mention is you and I both have that circuitry in our heads. I have pretty good color discernment. You obviously do as well, but not everybody does. So the only suggestion I'd have on your rainbow is what you can do in addition to color is you can darken or lighten them. So even a person who can't see, you know, a full range of color will be able to distinguish them. And if that doesn't work, you can also introduce subtle patterns and that's enough to distinguish them. And that's the only thing I would say at all negative about your rainbow arrays is they're perfect for somebody like you or I who can see color because you're right. Our brains are wired to make use of that, but other people's brains aren't always, you know, wired that way. But everybody who can see, everybody who's not visually challenged would be able to see patterns or shades as well. So that might be a way to make it a little bit more accessible in the case of somebody who's visually challenged. I guess you'd have to look at different ways of distinguishing, you know, those, those, but, but again, just by naming the axes, you're actually doing that, right?
00:46:49 [TB]
That's true. No, that's like a super fair point. And other people have made that point before for other sort of color based things that I've done. And I've been intending to like solve that problem. Like one way that I'd like to solve it as well. You're right. You can use shading. You can also just like pick colors that are still distinguishable for, I think they're called dichromat. There's different kinds of color blindness, but I would like to, one thing is like time is a whole axis that like in the world of JavaScript and, you know, interactivity on web pages and stuff, like you can do blinking. This is a crazy thing, but it's just like, if you imagine like every element of a diagram, like blinks at a slightly different rate, like the things that are in sync with each other will literally just stick out. You can just see what is blinking in sync with the thing that I'm looking at. So like all these tools, it doesn't have to be color based, but I do like that. You know, if you want to correlate things, if you're looking at a picture and you want to correlate bits with each other, which is a big part of understanding how all the puzzle pieces fit together into some, in some context, some like, you know, like a program or something. Things like color, but not necessarily just color are so much better than like scanning text where you've got to read pretty much every word to find the thing that you're looking for, to see what things match. So that's why it's a very long story. That's why I'm so obsessed with color. And that's why I call it Rainbow Arrays.
00:48:14 [RP]
And we're going to get the sound, sound version next where this axis goes, and that axis goes, and then you can hear the chords of the, of the array operations.
00:48:26 [BT]
Actually, just as you were saying that though, the blinking, another way I was just occurred to me to do it is you put a legend down the side that's interactive. And so when you hit a particular button, all those buttons would highlight. So you would see them the same way a person who's doing the color perception would be able to, they would jump out at you.
00:48:46 [ML]
Or just when you hover on a line that's in the actual diagram, you just say, well, what else uses this line and go there?
00:48:52 [TB]
Yeah. Mass over. And then it, and then everything else dims out. Like there's so much, there's so many examples of that. Like, so I love making educational materials. I'm just thinking about like textbooks of the future are going to be like this. They're going to think how often like to understand a particular diagram or image or equation, you've got to reference something that's from like a chapter ago. And then like C 3. 1 and you're like flip, flip, flip, flip, flip. Oh, wait a minute. And then you doing this, sorry, people on your listing can't see them. Like I'm visually flipping with my hands. And that's, that's a huge problem because it just slows down. It makes things needlessly. If it slows down your ability to correlate things and to like really grok something. And it's, it's so clear to me that like we at the dawn of a new era of kind of visual understanding with interactive textbooks and it's like interactive text sounds a bit corny, but like it's educational technology that is waiting to be unlocked by clever UI design that works with where our brains work. That like bring the equations with you. Like as you define things, there's no reason they can't float down onto the page as you scroll down so that you've always got the context that's relevant, like an IDE for learning basically. So I would love to spend a bit more time, like incorporating those into some of the blog posts are right to like make it easier to follow something that's got a lot of context. That's got like definitions that are spread out or, you know, connections that you should be aware of footnotes. There's no reason we have to have static web pages that are emulating text on a page. It's crazy.
00:50:23 [BT]
Have you seen any of Bret Victor's work? [09]
00:50:25 [TB]
Yes. Extremely inspirational.
00:50:28 [CH]
Yeah. I'm so happy the conversation has gone this way because it takes me to my, the main burning question that I had from last time that we never got to. So I'll read a, hopefully this is not, you know, divulging too much. This is a sentence from one of Tali's emails in the back and forth leading up to this when we were sort of sharing resources. And it says, there's so much of this kind of visual intuition that's missing from category theory exposition, especially when you embrace color as a visual aid, which is basically what we're talking about here. But that, that line made me think of, and then I went and looked up and had ready last time in chat GPT, of course, because why use Google when you can just ask the chat GPT, that there's two different quotes, one from Bertrand Russell. It says a good notation has a subtlety and suggestiveness, which at time makes, which at times makes it almost like a live teacher. And then Richard Feynman, who I think you, I recall you said last episode was like an inspiration, a lot of his work. He has also a quote on notation. We could of course use any notation we want. Do not laugh at notations, invent them. They're powerful. In fact, mathematics is to a large extent invention of better notations. Anyway, so those two quotes combined with your sentence, I like this resonates me with me so much because especially when it comes to academic papers, specifically a lot like mathematical logic and mathematics papers. When I read these, you know, I'm not, I'm not, you know, an academic in the sense of the, you know, PhD academic, I I've read a lot of papers, but not, you know, the steeped in algebras, et cetera. And when you see all these different symbols, a lot of the times, the ideas like even now, I consider myself a combinatory logic expert, but I, when I go and read the combinatory logic papers, the only reason I'm able to understand them is because I, I actually like grasp the concepts before going to the papers. And then when I see, Oh, there's a B here or a B one, I already know that that is the monadic or dyadic, you know, before, after, and BQN or a top and APL, I already know the concept. And so, because of that, I'm able to make sense of the paper, but the academic papers are just like awful in this regard. And like, one of the best examples is I can't remember if it was an actuarial math class in university, but at some point you get introduced to the capital Sigma in mathematics and it's, you know, capital Sigma, there's multiple different ways to write that symbol, but Microsoft word, you know, it has all the different variations, but typically it'll be like N at the top, I equals zero underneath. And then, you know, X subscript I, and it's just like so much overwhelming, like what the hell. And at the end of the day, it's just a plus reduction over like, like if you show someone a list of numbers visually in a YouTube video, and then you just go, Oh, let's just add them up together. Like kids know how to do that in like what grade two, three. I don't actually know when you learn that, but like, this is some fancy notation that they've added and that you're learning in university. And at first I was kind of confused. I'm like, Whoa, I equal zero. We got M we got this subscript, you know, what's going on here. And then they use that. Maybe it was statistics. Cause they end up using that notation for standard deviation and variance and whatnot. And I just, anyways, I'm interested to get your thoughts. Cause it seems like, you know, you're very big on sort of the visual explanation of ideas. And I absolutely love that because specifically with respect to category theory, like I would love to be a category theory expert, but I think a lot of this stuff there's just like, even, you know, Bartosz Miluski has his great book. He's done a lot of to like, you know lower the barrier to entry on that stuff. And I think the same can be said about a lot of the array, you know, common combinators, tacit programming. A lot of this stuff is actually quite, and I hate when people say this, like, Oh, it's, it's actually quite simple. Cause it's, it's clearly, you know, takes a lot to learn, but I think that they're like, you are totally, I completely am on the same page with you that there's like, not that we're doing a disservice to like humanity by making, you know, these ideas harder to grasp because like, you know, you need some kind of notation and way to communicate this stuff, but like, you know, fixed academic papers where you define some algebra and then you go from there. It just seems like anyways, interested to get your thoughts and everything I said there, because basically I'm just repeating to you what you already said via email.
00:54:25 [TB]
Well, thank you so much for saying this. It's like a very juicy set of topics that I really care about too. And I should just mention that like three cool connections I really want to mention. So first of all, yeah, I haven't really talked about category theory now. It's a big part of actually what's going on in this blog series, because those array circuits that I show that are kind of visually depicting first of all, how you can think about functions: how you can think about arrays as functions from the index space - the key spaces as I call them - to value space, like where the cells live. You can build circuits out of them, much like in a sort of visual programming language, but like information is flowing along wires and it's actually helpful to do so because (especially when you add color), it makes it really obvious; like certain patterns of information sharing and flow and so on. Here's a cool little connection that to me, is so mind-blowing. So Richard Feynman. One of the things that makes him famous ([he's a] brilliant scientist, obviously): he developed "Feynman diagrams" [10], which is ways of expressing interactions in quantum electrodynamics between subatomic particles. Like squiggly lines and straight lines and so on [to] indicate photons and electrons and how they interact. It was very helpful because it turned what was otherwise these very nasty sort of equations (calculus, integrals and things) into diagrams that you could see what was going on without trying to parse this visually noisy equation. And you could enumerate these equations and you could focus on this sort of structural topological aspects: it was a breakthrough. Fast forward a little bit: different area, still mathematical. So Roger Penrose, another great mathematician physicist, wanted to do similar things for tensor calculations, which are like a big part of general relativity, for example (that was his focus at gravity, general relativity). And you've got things that are very similar to array programming. You've got what can be represented as arrays with multiple axes. The way that you connect the indices of these arrays, which are really axes of these arrays, that's how you sort of set up calculations in tensor calculus, right? He designed a diagrammatic notation to capture the structural aspects of what was going on there, and it's called "Penrose tensor notation". I'm sure a lot of people have encountered it. It just looks very cool, very elegant. He used it to sort of simplify a lot of what he was doing. That Penrose tensor notation [has] actually evolved into things that are more widely used now in a lot of different fields called tensor networks. Tensor networks are one of the kind of sub-diagrams that I use in the rainbow algebra [blog] post, because they specialize to a particular sort of linear setup where [you're] multiplying and then adding; like you're doing a multiplication of several arrays and then eventually you summing away an axis. And so that's what they're good at, is that particular case; they sort of simplify a lot of bookkeeping when you look at that.
00:57:37 [TB]
Now, the great unifying thing is it turns out that both Feynman diagrams and Penrose tensor notation (and tensor networks more generally) are instances of a very abstract phenomenon that category theory describes. Category theory describes these things called monoidal categories; I won't describe what they are because it's too much to go into now and you also really do need pictures for this. But there's a diagrammatic notation for monoidal categories that are called "string diagrams". These string diagrams just present these algebraic concepts in a visual way, and fascinatingly, Feynman diagrams and Penrose tensor notation and tensor networks and my array circuits are all examples of string diagrams for different cases of category. So pick the category: is it talking about physics? Is it talking about linear algebra? Is it talking about computer programs? And this uniform diagrammatic machinery of monoidal categories and string diagrams just gives you a visual language for how to compute in that particular context. I think that that's one of the reasons that it's really worth understanding category theories: it gives a universal language, but you can customize for all of these different cases. It's been reinvented all these different times because it is so useful. So that's a little pitch [chuckles] for category theory and string diagrams. And [in] a future blog post, we'll try to really unpack that. It'll get at the category theory that's behind a lot of the way that I present [for] thinking about array programming. So thank you. That was really cool [chuckles].
00:59:14 [CH]
My one quick follow-up question: are there good resources? Like what did you use: one for learning category theory? [11] And two, is there a good resource for these monoidal category string diagrams. Obviously a lot of the category theory stuff out there is dense. Like, what would you recommend if someone's hearing this and does want to do a little deep dive [if] you've got their interest piqued?
00:59:36 [TB]
I think you mentioned Bartosz Milewski (I'm not sure if I'm pronouncing his name correctly); he does have a great book, "Category theory for Programmers", I think it's called. There's an online version of that. There's also a print version I've got, that's really nice. It's got cute little diagrams that feature little one-eyed, two-eyed, three-eyed pigs: very cute [laughs]. And he does use string diagrams there.
01:00:04 [CH]
Oh, interesting. So those diagrams in that book (it's been a while since I read it), but I don't recall them ever being called out as string diagrams, but maybe they were, and it didn't stick in my brain.
01:00:14 [TB]
Bizarrely string diagrams even apply generally to category theory itself. So what do you do with category theory? You play with functors and natural transformations and you can draw those as string diagrams. I think someone called it "the tar pit" or "primordial ooze". Category theory is so filled with examples of using categories to describe itself in different ways, like you can build categories out of machinery derived from other categories: it's like very recursive. It's super weird, but yeah, you can even use string diagrams to describe operations in category theory as a kind of uber example of the power of string diagrams.
01:00:55 [CH]
Yeah. I do recall in that book that you go from chapter to chapter and at one point, what was a morphism in a category is now like the category. You go up a level and the thing that was the category [laughs] ... [sentence left incomplete]. It's just like, you're reading it being like: all right, so now what was the focus before is now where we're headed anyway. The word "recursion" does not capture like how mind bending that stuff is [laughs].
01:01:20
It gets even crazier because you can operate on two levels simultaneously: you can go up a level and stay in the bottom level. And then you need higher dimensional diagrams to reflect what's going on. So you need, instead of wires, you need sort of sheets and it becomes very topological. It's crazy stuff.
01:01:38 [CH]
Clearly you understood the category theory stuff a lot better than I did [laughs].
01:01:42 [TB]
I'm still learning. I'm still getting there, but I do think that it is hard to get started on, and that's partly because it's evolved in a very higgledy-piggledy way through history. So it didn't start as general as it is now. I think people didn't realize how powerful it was at first, or at least the number of people who realized how powerful it was, is very small. Now it needs to be broader. It's becoming broader. More people are getting interested in it. And some of that historical baggage does make it challenging to learn. Of course, it doesn't help that academics will often write for each other. And once you sort of know what's going on, there's really no incentive to explain things from the beginning, especially for a new audience. That's one of the reasons I love Bartosz Milewski's book so much is that it's for programmers. It will piggyback off all the intuition you have about like functions and input and output types of functions. Like all these ideas are already there; you already have got a good handle on them just from experience. And he shows how category theory illuminates topics like contravariance and covariance and how functors are very natural and how data types like lists are natural examples of functors and things like that. So what I'll say as well is that maybe watch this space. I've already got a little bit of material about string diagrams and how to understand programming. It's kind of along the lines of Bartosz, but like with my unique spin on things and color and that kind of thing. I'll be making some blog posts about that, and I'll definitely sort of publicize them when they're ready. Hopefully get as many people on board with the whole category theory program as I can.
01:03:27 [CH]
Well, definitely send us a note when you publish those and we will make sure to announce them because I am definitely interested. My guess is if I'm interested, there'll at least be a couple of other listeners, if not many of them.
01:03:39 [TB]
Awesome
01:03:40 [CH]
I feel like you had a couple more thoughts; I can't remember. Like I did my little ramble and then you said: I've got a bunch of thoughts. Did we go through them all? I want to make sure we didn't lose or pop a couple off the stack because we are past the hour. But if you've got a couple of things, we can still ... [sentence left incomplete].
01:03:53 [TB]
No. So I have a, maybe a mini-announcement, which is that I've substantially rewritten the "rainbow array" blog post. Just as a sort of advertisement, I tackle transformers. Transformers are obviously something that a lot of people are curious about; they are complicated. They do have a lot of axes floating around, so they're a good sort of test case for how rainbow circuits can help you see them in a different way. I wasn't particularly happy with that before: it was a little bit over complex and it was actually an error. But I just wanted to do a little shout out and say: if people want to take another stab at that, it's substantially different than I think simpler now and it might be useful. There's also a fun little thing if you're a functional programmer and you've encountered closures and anonymous functions: there's like [a] really cool use of them in array programming. It turns out that nesting arrays (seeing matrices as vectors of vectors and vice versa and collapsing vectors of vectors into matrices and higher versions of that) are like very natural examples of closures and anonymous functions. I mean, that's not how you would necessarily implement them, but if you want to adopt this perspective that arrays are functions of a particular kind, then it is (just for me), quite interesting that closures and anonymous functions have naturally pop out as being the tools you need to do that very common kind of operation. There's a little section on there that I added. I don't want to self-promote too much ... [sentence left incomplete].
01:05:25 [CH]
That's what we're here for. I mean, when people come on podcasts and they're like: oh I don't want to ... And it's like, one, we're here to have a conversation. But I think that's one of the best parts of podcasts that I enjoy. There are some [in which] you'll listen to a 60 minute conversation and they'll talk about 15 different things. And then you'll go to the show notes and there's like two links: one is to their Twitter bio, and then one is to the main topic that they talked about. But every single YouTube video or whatever they mentioned, you have to go hunt for them for yourself. So I definitely on Array Cast, we try to accumulate all the links so that if folks do go searching, they can just find the timestamp, click and they're good to go. Maybe as a final thought, I was thinking about this at one point (but then we started talking about the notation and visualization and how important that is these days); the thought that popped into my head is that I'm a huge fan of tacit programming [12] as most folks know. And maybe the term tacit programming is more like a specialization, in the sense that [when] we talk about tacit programming, it refers to tacit arguments. But at one point when you were talking Tali, I was thinking (and you were making the sort of a comparison to procedural programming): the thought popped in my head: yeah. it's crazy to think at some point the subroutine was "invented" (quote unquote). And there was a time before that where you just had jump statements and go to this line of the program. And then someone was like: what if we just bundle this stuff? Does it really matter that it's at line 73? Probably would make things better. And everyone's like: oh, that's a pretty good idea. And now you jump forward and there's tacit versus explicit programming, which mentions arguments, but it's almost like kind of what you're advocating for. And that hasn't really been said: we in array programming languages kind of have tacit axes. Not really because when you have the rank operator, you can refer to these with a rank, but in general, in the leading axis theorem world, you have these built-in operators that, a lot of the times you're just sort of using them on axes that are tacit. You know that if you use a "+/", that's going to operate on whatever your rows or columns, depending on the language that you're in. And what you're advocating for is basically like explicit axes. The reason I bring that up is [for] a lot of folks, when I have conversations [with them] about tacit versus explicit (even though I love tacit and I think there's an elegance in it), it is very true that a lot of the time when the size of your program grows and you're trying to write some tacit expression, it becomes harmful to the readability of the program, when you try and chain all that stuff up. The amount of state and the kind of mini abstract tree that you have to like build in your head, it's too much. Like it's a fun exercise, but really like the best tacit is [in] small little examples. When you get to a larger example, it's nice to name your variable names so that [it] has a name and it's not, whatever. I'm not sure if you have any thoughts on that, but headed towards a sort of an explicit axes world will (the same way that explicit functions improve readability at a certain complexity level), probably the same thing is true about these explicit axes. Anyways, just a thought I had in my head at one point.
01:08:35 [TB]
Everyone has their own preferences, I suppose, but I've definitely noticed for myself, as I've kind of had more and more experience of reading my own old code [chuckles], I've become extremely committed to very long and descriptive names. Like for functions, right? It's like the more breadcrumbs you can leave for future you or even worse someone else to reconstruct what you were thinking, the better. Yeah, it can seem like a step backwards to like increase the verbosity of everything that you're doing. Especially when you're designing it's fun to solve these little puzzles and you're proud of the little state machine that you built in your head that models what your program is doing. You're proud of all the little corners that you could cut or how well you could play the Tetris, right? But playing it in reverse, when you're coming back and you've lost that context, is like twice as painful as whatever pleasure you gained from the exercise.
01:09:36 [CH]
We've all been there, right?
01:09:38 [ML]
I will say though, one concern I have about the named axes model (and this maybe not quite similar to tacit programming), but the concern is if you have intermediate values. If you're writing "A plus B times C", you don't name "B times C". So you can't just name everything you. You can: the programming language allows it. Nobody at all does that. So one concern I have about named axes is when you have an axis that's computed, that is more or less obvious from context, but two axes need to line up and they don't. So one thing that I might do in parsing a lot, is if I have my string that I've kind of implicitly split up into a bunch of segments. One way to indicate this (if the segments are not empty) is that I have a boolean array with a one at the beginning of each segment. Maybe also I want to work with the end of each segment, so I care about the beginning and the end. To get the end of each segment, I'm going to shift a one into the right of my boolean array. Now I can get either a character from the beginning of the segment by filtering by this first array, or a character from the end by filtering from the second. Normally it makes sense that if you do an operation, you can say: there's an axis associated with this; I don't know what it is. For filtering the axis is not the same as either argument. It's based on the values in the filtering argument, this boolean mask. The program can't even know. What I want to do here is I want to say: the things from the beginning of each segment and the end of each segment have the same domain because, they correspond to the same segments. But when I'm writing the program, you can't even know that they're the same length because [for] my mask of segments, if you can't prove that it starts with a one, then when you shift a one to the end, if it didn't start with a one, then you've actually increased the total of it. That seems kind of tough. You have to assert to the program that when you filter by these two masks, you get the same axes. That seems like one way that named axes can really slow you down and add something that really is obvious when you read it, but you're forced to specify it in the program. And that makes programming kind of harder.
01:12:05 [TB]
Hmm. I'll say one thing. I didn't fully understand the example, but it sounds like we'd have to screen share, to see. But I'm sure you're right. One thing I'll say is it's a bit like I'm presenting named axes is everything. [It] is a little bit of a (what do you call it?) a figurehead. Because they're clearly cases where they're going to be exactly two or exactly three things, and you don't need to name them. Like, the number is a good enough name. In fact, it's the right kind of name. So we should really have a blend [laughs].
01:12:45 [RP]
I will say, I have been assuming the whole time that when you have just a list or a vector, you're going to be just writing "sum" and expecting the only thing to happen, you know?
01:12:55 [TB]
Yeah. [Tali and Rich laugh]
01:12:56 [ML]
Well, yeah. Usually that's fine because if you infer axes ... [sentence left incomplete]. I mean, you say: this vector has some axis that's inferred. When you sum it the axis disappears, and it doesn't even matter what it was. Where you run into trouble is if you want the correspondence between axes. So if you have axes [when] you're writing the program and you think they correspond, but you let your interpreter infer and it fails to infer that they correspond, then it's going to do an outer product when you were expecting a "zipped" or an "each" product. Then you could end up with a result that you don't expect at all.
01:13:32 [TB]
Yeah, exactly. So that's a good example where it feels harmless, but maybe there is some benefit to that extra as well (I know there's a lot of us and maybe some people need to go). But one fun thing, I have no idea how to do, but I feel like there must be a way of doing this is the way that the names of these axes, like the conventions for namespacing that you want to impose as regards function namespaces ... [sentence left incomplete]. It feels like that detail is going to be so important to figure out because arrays themselves will have names because they live in variables, right? But you can see variables as fields of a record of whatever namespace your file is. Or if they're local variables in a function, that function is its own little namespace, which is kind of like a record and records are kind of like arrays. So everything blends into this morass and there must be some, I'm hoping, deeply powerful, simplifying principle that makes it all roughly the same so that not only [are] arrays of functions, but also like functions are effectively arrays and namespaces are arrays and it's all arrays and evaluation is all one thing [chuckles]. That would be super cool but I have no idea what that looks like.
01:14:55 [ML]
Yeah. And the direction I've kind of headed in ... [sentence left incomplete]. I really don't like the way that a lot of frameworks people are working on, which I haven't worked with; maybe it works out. But I don't like the way that they attach the names to the data instead of to the source code; a lexical versus dynamic kind of thing. The direction that I'm trying to head in at least is that axes are purely a compile time concept: these names are defined lexically. So you'll declare in the program: I have this variable at this time and it's axes are A, B and C. That's an aspect of your source code, which is checked like a type would be, and then compiled out. That seems the most promising for me, but having not tested any of this out, it's very hard to say.
01:15:44 [TB]
I want to mention that reminds me of something that kind of really blew my mind when I thought about it deep enough, which is with Zig: [13] Zig's made this observation that if you erase the distinction between compile time and runtime [and] rather you make them just stages that have a convention attached to them, you're going to produce this executable object, but as much of your program as you want can run during compilation and produce types that you then use in the next stage of compilation, right? That is so different from the way that people achieve polymorphism in traditional languages. And it's so cool [chuckles]. It's so bizarre. It's like, let's just do staging. You don't need a whole different language. You think about Rust which has got, you know ... [sentence left incomplete]
01:16:33 [ML]
Well, I have to object here because I have [chuckles] (maybe you should listen to my Singeli episode), but I made a language where the compile time and the runtime *are* different languages. I think like Zig and D are essentially doing the same thing, but they're using the same syntax for these two different languages and the semantics are not actually the same. So I think you run into a lot of trouble by trying to make them the same language.
01:16:58 [TB]
I kind of agree, like, it didn't go far enough. There should be ... [sentence left incomplete].
01:17:01 [ML]
No, I don't think there is a far enough [laughs].
01:17:05 [TB]
Okay [laughs]. May be we complete disagree. We should perhaps continue this discussion offline because it's probably quite off topic, but it's really interesting. I'm very curious to hear that opinion.
01:17:14 [CH]
Alright. Well, I guess we're just going to leave our listeners hanging then. Will there be a part three where we stop talking about array languages and start talking about compile time versus runtime types? I mean, I will say it's very novel what Zig has done, especially coming from ... [sentence left incomplete]. Every C++ programmer on their learning path at some point realizes that template arguments are just like a different type of argument to a function [chuckles]. Like, you can write the same types of functions, and a lot of the times, if you know what the value of your integer type is, you don't need to pass that in as a runtime argument; you can pass it in at compile time. But the syntax for that is: chevron brackets (the "<>" in front of the parentheses). Really when you think about it, you're calling a function that has compile time arguments and runtime arguments, and the syntax for that is ([assuming] the name of your function is f): "f<however many like compile time arguments you have> (however many runtime arguments you have)". A lot of the times you can just start messing around, and shifting your runtime to compile time, as long as it's like an integer constant that you know, and that comes up even in like modern design. We have new algorithms that were introduced in C++ 23 where there's a slide() function and there is a (what is it called?) adjacent(). Basically these are the same function. These are kind of like the n-wise reductions in APL, if you just want like a sliding window over your array and then you want to do something to that. slide() passes that window width as a runtime argument, but adjacent passes it as a compile time argument (there's a couple other slight differences, but basically they're the same semantically). The difference is just that sometimes you're going to know what this window is at compile time, which is going to lead to increase in performance because you can do a bunch of stuff at compile time now. Whereas in Zig they don't have a delineation between these two and you can just write your code. And I haven't done any Zig programming, but I've seen some Zig code and it's like: huh! As a C++ programmer, you see that and we have such a hard cut in between [chuckles]. Like they're literally two different brackets next to each other. It's like: here's your compile time list; here's your runtime list. It's a very novel thing, or maybe not that novel, but it's just very different.
01:19:36 [ML]
Yeah. I think D does something pretty similar.
01:19:38 [CH]
Yeah. And it's kind of nice in a way, if you can get everything you can get in C++ and not have to think about it as a programmer, like why would we do that? But obviously there's trade-offs and stuff. So I'm not sure if you want to have any last comment or do we just tease the listener? We'll end it there. And maybe we'll do a [chuckles] Zig, C++, Rust compile time ... [sentence left incomplete]. What was the name of ... [sentence left incomplete] Singeli was the name of your language, right Marshall? [Marshall agrees]. We'll do some compile time versus runtime arguments and what's the state of the art and the best option for doing this kind of stuff.
01:20:14 [TB]
I'll definitely check Singeli out; that sounds very interesting. Thanks for that. Yeah, you should still find like some excuse to have Andrew Kelly on to discuss Zig. I don't know. I suppose it's a bit off topic for array programming, but look: he's written a data-oriented programming book, if I'm not mistaken. And there's also some cool stuff in there with like structure of arrays, array of structs. Like that automatic transformation that Zig makes really easy. It's essentially built in, but it's part of the standard library. So there might be a connection to array programming sort of, if you squint. But anyway, whatever.
01:20:48 [CH]
Yeah. I mean, I'm not sure if we'll have him on here, but he is actually at the very top of the list of folks that we plan on interviewing on my other podcast: ADSP. We [were] actually supposed to do it in October, but then we ended up interviewing Richard Feldman instead, who is working on the Rock programming language. But yeah, both his sort of language ideas and also I love his sort of (I don't know what you call them), world philosophy and just sort of the approach that he's taken to building an organization behind Zig, I think is also fantastic. But anyways, with that, whether it's here on ArrayCast or on my other podcast, ADSP, at some point we will get Andrew Kelly on. For those that you don't know, he's the individual that made the Zig language. With that, you know, I thought we were going to land this only 10 minutes past, but now it's half an hour past. I don't know what makes us more consistent [laughs]. But I will throw it over to Bob, who will mention where you can reach us at.
01:21:46 [BT]
You can reach us at contact@arraycast.com [14] and we welcome your comments. The last episode we had with Tali had a lot of really good comments and look forward to more. And of course, check out Tali's rainbow arrays, because it really is quite a neat thing to look at. Well, you're coming to the end of this episode [laughs]; it would be really great if you'd looked at it before the episode, but now that you've read it, maybe you look at it now and you get a better sense of what we were talking about. And other than that, I think right now, Bryce would be losing his mind if you did to ArrayCast [laughs], what was just done through ADSP. Oh, he just goes nuts.
01:22:28 [CH]
Oh, yeah, he always gets upset whenever I mention ArrayCast [chuckles].
01:22:32 [BT]
ADSP is Conor's other podcast, and we're not afraid to promote it here. So we'll do that and that's about all I got.
01:22:42 [CH]
And don't tell Bryce, but I think this is by far the better podcast. I think [with] our other one, it's less of a niche topic, so there's more listeners. But I think the quality of the conversations we have here are ... [sentence left incomplete]. For example, the last two episodes with you Tali, I mean, this has been fantastic. Did I think we were going to be going into (I'm not even going to remember it) the Feynman diagrams and the Roger Penrose tensor notation, and that these are all in the same category (a category theory). It's just fantastic. I never know where these conversations are going to go. I always come away with more things to go and read and learn and do. It's fantastic. So yeah, once again, thank you for spending, you know, three hours at this point with us. And maybe sometime in the future, after several category theory blogs, and some developments, we'll have you back and we can chat about everything you've been working on from now and until that point in time.
01:23:32 [TB]
Thank you so much. I had a great time.
01:23:33 [CH]
And with that, we will say Happy Array programming!
01:23:37 [Everyone]
Happy array programming.
[Music]