Inside Tech Comm with Zohra Mutabanna

S6E4 Audio Synthesis in AI: Breaking Down Barriers

Zohra Mutabanna Season 6 Episode 4

Manny graces our show once more to share insights on audio synthesis, a technology that has evolved far beyond robotic voices into sophisticated AI systems. It has made documentation more accessible and enhanced technical communication workflows. Manny and I discuss practical applications, implementation strategies, and ethical considerations for incorporating audio synthesis into documentation.

Key takeaways:

  • What is audio synthesis?
  • How the "curb cut effect" of audio options benefits everyone, not just those with disabilities
  • How can you implement audio synthesis through CI/CD pipelines for docs-as-code or integrated into CMS publication workflows?
  • How should you prepare content with alternative descriptions for visual elements like code blocks and diagrams for audio synthesis?
  • Considerations for voice cloning and careful ethical consideration to avoid misuse.
  • What are some proprietary and open-source options that provide audio libraries?

Guest Bio

A tech writer by day and engineer by night, Manny Silva is Head of Docs at Skyflow, codifier of Docs as Tests, and the creator of Doc Detective. He’s passionate about intuitive and scalable developer experiences and likes diving into the deep end as the 0th developer.

Show Credits

  • Intro and outro music - Az
  • Audio engineer - RJ Basilio



Zohra:

Hello folks, welcome to another season of Inside Techcom with Zahara Matabana. In season six, we unpack how generative AI works and what it means for your Techcom workflow. From core concepts to practical use, we're gonna go under the hood. It's time to adapt, create, and thrive with AI. Let's dive in. Hello listeners. Welcome to another episode of Insight Techcom with none other than Manny. Hi Manny, how are you doing this morning?

Manny:

I'm doing well today. How are you doing, Zora?

Zohra:

I'm doing great. Manny, this is a follow-up conversation, so to speak, on our first episode. And you've introduced yourself several times. So we are going to jump right into our questions. But I think for our audience sake, if somebody's listening to this for the first time, just a quick background about yourself would always be good.

Manny:

Sure. So I'm Manny Silva, head of documentation at Skyflow, author of Docs as Tests, a strategy for resilient technical documentation, and creator and maintainer of DocDetective, an open source toolkit for testing your documentation. I like to dabble in all shiny new things. And so, yes, I uh do a lot of experimentation with AI among my other hobbies.

Zohra:

And so here we are with Manny to pick his brains to know all things about AI as we focus and kind of do a deep dive into what's happening behind the scenes under the hood as we prepare to integrate AI, generative AI, so to speak, within our technical communication discipline. And the last time we met Manny, we talked about how do we integrate audio synthesis and we talked about agents, what are agents? And you had talked about or you had suggested audio synthesis in our brief conversation before we interviewed. And I thought, okay, let's talk about audio synthesis because we know or we are familiar with systems that integrate audio, but how does that inform or I would say direct where we can take that technology with technical communication? So to level set for our audience, how about in simple terms, Manny, explain what is audio synthesis and how does it fit into the landscape of AI technologies?

Manny:

Sure thing, Zora. Audiosynthesis is a more modern term for something that most of us are more familiar with than we think. It's text to speech. That's all it is. And there are a few other flavors, but at its root, we've all interacted with systems that have some sort of computer-generated vocalization. Congratulations, that's a form of audio synthesis. But it does expand beyond that. There are some systems that specialize in generating sound effects. There are some systems that specialize in ingesting speech and outputting speech. So instead of a text-to-speech, it's a speech-to-speech system. And then it gets a bit more complicated when you have systems that are what's called multimodal, where they can accept multiple different kinds of inputs and they can create multiple different kinds of outputs depending on what you're asking it to do and how you're prompting it. But at its simplest, audio synthesis is the process of generating audio via whatever system you're using.

Zohra:

And if I've got this right, the daily tools that we interact with, such as Siri and Alexa, do offer audio synthesis, text to synthesis.

Manny:

100%. Even going back to those uh, and this is dating myself, but the in-car like GPS dedicated GPS systems that would say turn left at that have been completely replaced by Google Maps and haven't for a long time. Yeah, that was audio synthesis to a degree. Very simplistic compared to what we're dealing with today, but roots of the same technology.

Zohra:

Gotcha. So which means this technology has been around for a long, long time since I'm dating myself as well. And I'm actually pretty happy dating myself to show how the technology has evolved from uh day one, I would say, or day zero of our first interaction with a clunky tool and where it is today, where you can literally have a conversation with it.

Manny:

Yes, and that part of it depends on the underlying technology because something like those old GPS devices, you couldn't have a conversation with exactly, exactly. Those weren't based on the same tech stack that the modern audio synthesis stuff has. That was it literally took some text and then it mapped it using a series of rules to whatever audio was output, which is why it sounded so robotic and repetitive. But now the more modern systems use the same sort of architectures that the large language models use under the hood. There are completely different architectures that some of these tools use too, but they're much more, they're trained in a very similar way. And even some of these multimodal models are based on the large language models that we have become much more familiar with over the last few years. But yeah, that means that any interaction that you could have with a large language model, you can have verbally with one that can do speech understanding and speech synthesis. And you can see this already today. Like ChatGPT, I believe they call it interactive mode. Claude has it, Gemini has it. This is becoming more and more of a staple in these technologies. There are also open source versions of it too, but we can get into that later.

Zohra:

Perfect. And I will make a note of the open source model that I do want to touch upon here. Again, I like to kind of make sure that we are talking in the context of generative AI here. Yes. Right. Got it. We did since our first episode was about AI agents. My follow-up question is how do AI agents use audio synthesis today? And what are some common, I mean, we have talked about the common world examples already, but just again, more at a high level, how are AI agents using audiosynthesis?

Manny:

Sure. So let me give you a personal example. A few weeks ago, I was struggling with a problem that I couldn't quite work out. I was caught up in my own head, and there was no one immediately around who I could either call or speak to to work through it. So I went out for a walk, but that didn't do the trick. And so I put in an earbud and I pulled up my AI tool of choice, and I just started talking to it through my earbud while I was doing my walk. And I was like, hey, I'm dealing with this really tricky engineering problem. I have these considerations, help walk me through this. And it was like doing the Socratic method on my own, because I had this artificial tool to help me through it. And so this is an extreme real world example. And that that was my experience. And I've done it on multiple occasions. And it's like the engineering equivalent of rubber ducking, where an engineer has a rubber duck on their desk that they talk to to work through problems, except this time the rubber duck can respond and give me suggestions.

Zohra:

Such a great analogy of the rubber duck and something personal that I would like to share. So when I go to the gym and I'm we are training for a marathon, we have rubber ducks on our Fedmills when we are in the gym just to motivate us, something funny. Uh, it doesn't talk to me, of course, but it brings in a sense of humor. But you mentioned something about the Socratic method. What is that?

Manny:

So the Socratic method is the preferred means by which the ancient Greek philosopher Socrates would teach his students. And instead of having a lecture hall where he would speak to everyone who was assembled, Socrates did more one-on-one or small group tutelage sessions where he would ask people questions and he would prompt them to do the critical thinking to arrive at their own conclusions. He would never give them the answer. He would lead them to their own answer. And so when I am interacting with generative AI, particularly when I'm actively having a conversation, whether that's via text or speech, I am not looking for it to tell me what to do because these tools are unreliable. We should not be following their guidance. But if I ask it a question, it can offer me suggestions. I can interrogate its suggestions. It can come back at me with questions about, well, you said this consideration, and that's not going to be met with this line of thought that you're following right now. And so it's more of a back and forth conversation.

Zohra:

I want to, of course, go back to the example that you shared where you had a conversation, quote unquote, with your AI tool of choice. At the end of it, did it validate what you were, what you already knew, or did it give suggest new ideas that you could then go and vet with another human?

Manny:

In that particular example, I had ideas for how to proceed, but I couldn't figure out which would be the optimal path. And so it was helping me break down the pros and cons and the engineering challenges inherent with what I was dealing with. And so by the end of the conversation, I was reasonably confident I had a path forward. And indeed, I did. Uh, when I had the opportunity to run it by somebody, I had already written code that had like I had proven to myself that indeed this strategy did work. And when I talked through the logic with somebody else, they're like, Yeah, that's sound. And so that was validated. But by the end of the conversation, my interaction with the AI itself, I was already reasonably sure I knew which route I was gonna go.

Zohra:

Awesome. And again, something different. My brother was looking for another opportunity, and he was trying to sort of do some mock interviews. And he started with using his AI tool of choice. In this scenario, it was ChatGPT, and he did some interviews, created a prompt, asked ChatGPT to be the hiring manager, provided all the details, gave it some scope, uh, with very tight, I would say, rules on how to conduct the interview, just to get him started. And as I was watching him do that, I realized how I almost forgot that this is not a human. It's it's an AI tool. You become so comfortable with it. And I bring up this question for myself as well. How do we, if you have an answer, that is, uh, since you have interacted with AI tools of your choice, how do we kind of have these boundaries? Because I tend to sometimes get carried away. And I read these scary stories of what influence AI can have on you. Yes. What are some guidelines you would recommend?

Manny:

Keep it professional. Don't have a conversation with this thing that you wouldn't have with one of your colleagues. And I'm not talking about after work you go out to a bar and you have a few drinks and you have those conversations. No, don't have those with the AI. Like if you're in a workplace setting, if you are trying to figure out a problem, then yes, go ahead, use the AI. But be very careful what you discuss with any of these things, regardless whether it's via text or speech or any other form of interaction, because uh, this is getting a little off topic, but as of now, Chat GPT has to keep logs of every conversation that is had because of ongoing lawsuits. They were ordered by the U.S. Department of Defense to record every single conversation and not destroy them, even if the user clicks delete. And not only that, there are lots of instances where people have shared conversations with other people. That little share button that Chat GPT has, guess what? That means that the entire conversation is publicly available. And many of those, and when I say many, I mean like hundreds of thousands of these publicly available conversations got indexed by Google and other search providers. And a lot of them got backed up to the Internet Archive. And so even after they were de-indexed, they were still archived. And there's no getting rid of them once they're archived. It's the same truism that has been around since the dawn of the internet. If you put something on the internet, it's there forever. Maybe not in the form that it originally was or in the location it originally was, but it's there forever. So these tools are not our friends. These tools are at best our co-pilots. They are just that, they're tools. And we should not be discussing private or personal matters with them for data privacy concerns, and also for our emotional and mental well-being. There is a very worrying trend that a lot of people are using AI tools as therapists, especially when it comes to speech synthesis and generation, because it's like you said, it's easy to forget that it's not human, especially when it is interacting in more human ways. And as these tools get more powerful, as they get more emotive, it's going to be even harder to remember that. So you have to have a really clear line drawn in the sand that you don't discuss anything sensitive or personal with these things because they aren't meant for that. They weren't designed for that. Using them for those purposes can lead you down very difficult and painful paths. And we we've seen that happen already. This is not a hypothetical.

Zohra:

Right. And I'm glad that we segued into this area because it is extremely important. Doesn't matter whether it is text-to-speech or just text interaction, keeping this in the back of your mind at all times. And as you start using these tools for work, of course, in a work environment, there may be guardrails already protecting you. But if you are not, if your company is not set up for that, it's critical to be mindful of what data you are sharing, professional data. As much as you have to be careful with your personal data, professional data as well, just in case you're leaking out something that shouldn't be leaked out. So extremely careful. Thanks, Manny. This was, I think, very critical to have this conversation and uh bringing this dimension back into the picture that these are not humans that you're dealing with, and how we interact with them becomes even more critical. We will become great at using them, but then we shouldn't forget that we are the human and we are the drivers. And I I always bring this uh back to as a reminder for myself as well. In the context of audio and AI, generative AI, we are we hear the term multimodal.

Manny:

Yes.

Zohra:

And I think you kind of used it as well at the start of the conversation. So let's define what that is as well.

Manny:

Sure. So multimodal means that the AI model can operate in multiple modalities, modalities being text, video, audio, images. And so each of those, I'm sure that there are going to be others that people come up with, but by and large, it's those four or some variation thereof. And so when a model is multimodal, it can either accept two or more kinds of input, like ChatGPT, it can accept images, it can accept text, it can with the interactive mode, it can accept audio, or it can output multiple different modalities. So it can output text, it can output speech, it can output video, like when we get into Google's VO3. That is a video model, but it's text or image in and video out. So that makes it inherently multimodal. And that's all it means. Each model is going to be a little bit different about what it supports at either end of the interaction. And just like the large language models generally can be biased based on their training data, guess what? These are still large language models. And now their training data isn't just text, it's video, it's audio, it's images. And all of these different kinds of training data can bias the model in different ways. So going back to Google's VO3, if you really look into it, most of the outputs, or at least the highest quality outputs, look like YouTubers. It looks like somebody looking at a camera and talking to the camera. And maybe they're outside with a selfie stick. Maybe they are sitting at a desk or something like that. But guess what? The suspicion is that Google used a whole lot of YouTube training data in VO3, and because most of it that the model was trained on involved people looking at a camera and talking to it, that's what it can generate best. It has a much harder time generating things like sports events or vehicles or whatever else, because there was less training data for. Now, I do want to state that I do not know for certain that they use YouTube training data. That is the generally held assumption. There's been no statement either way, as far as I'm aware. So Google don't at me. Uh, but that that is the general understanding. And overall, that's what multimodal is. It's just multiple modalities, either in or out, or both.

Zohra:

And again, when we give these examples, it's what we, or rather, I have gleaned from the media. So of course, please don't take my word for it. Go check it out yourself. Right? And we we are not claiming one way or another. So please, this is just what we've read in the media. And we try to as much verify as we can and back up ourselves. But again, without any confirmation, we are not trying to point any fingers out here. It's just a discussion and examples of what can happen in the real world. So you regardless, you want to be careful. That's that's I think what our message is at this point.

Manny:

Even with all of this, there are, and even with the concerns around the modern version of this tech, it can still be wildly useful and it can, like any technology, have a lot of good side to it. I don't want to scare people away from this inherently. Like, if, for example, someone is visually impaired and they have a hard time with text, then guess what? If the same model that can output the text can instead output speech, their life is immediately and dramatically improved in that interaction. There, if someone is not necessarily visually impaired, but they're in a place where they simply can't read at in that moment, then these systems can still help tremendously. So, for example, the Google Smart Home systems, the Nest speakers, I found them invaluable when my kids were really little because I'd be bouncing a kid and I'd be trying to keep everybody calm. And I didn't have a chance to go and grab my phone to do whatever, but I was just able to give it a command and it was able to do the thing and give me a vocal response, an audible response. And I did not have any sort of disability or consideration in that way generally, but in that moment I did. I was inhibited in what I could do, but this sort of tech was still a lifesaver. Um, there is a lot of upside to this, and there's also an upside from a more technical documentation perspective, in that not everyone is going to want to read our content, has the time to read our content. And if somebody wants to listen to it, like you may have seen, many news sites have a listen to this article by pressing this button link. Awesome. That makes life easy. Somebody wants to listen to a piece of your docs because they've been swamped and they're going out on a walk or they've just been looking at a screen way too much and they want to take a little bit of the edge off that way, then they can listen to it instead. And it makes life a lot easier. But also, if we one of the things that I know I at least have heard many requests for over the years is more videos in our content, one way or another. But there are a lot of us who either don't like our voices, who are not confident in our vocal performances, or for whatever other reason, don't want to have our literal voice out there. But we're very confident writing text and writing scripts. And so if we write script to a video, then you can generate that and have an audio track that you can then pair to whatever video you're producing. And guess what? If a little bit of text changes, you just get to regenerate that portion of the audio track. And you don't have to do the whole thing over again. So, and you don't have to worry about, oh, well, I'm not sounding great today, but or I don't have the same recording setup, or I'm not in an audio studio, or whatever else. It lowers that barrier of entry for acceptable quality audio for professional settings.

Zohra:

Manny, you've covered so many examples here in such a great way. You illustrated accessibility, inclusivity, the applications of audio synthesis. And these were all of my follow-up questions, and you just answered everything in a nutshell right here in such a great way. But I think for my benefit personally, we are going to again tackle those a little more in detail. Accessibility is such a great thing. No, this is fun, this is great. You know, you keep me on my toes. But great applications, great examples of where this technology is a savior, uh, the upsides of this technology, absolutely. Manny, you wanted to say something?

Manny:

Yes, it is. And accessibility is near and dear to my heart as I know it is to many people in this in our communities. And there are so many people who benefit from any sort of assistive technology. It's the curb cut effect. For those who aren't affiliated who aren't familiar with that, curb cuts are if you look at a sidewalk next to a road, there is a curb that drops the sidewalk down to the level of the road. And for folks who are disabled or that perhaps they're in a wheelchair, they're using a cane or a walker, those curbs can be difficult to manage when they're trying to cross the road. And in some cases, can make it actively dangerous to try to cross the road. And so a curb cut is where the curb dips down and there is a gradual slope that leads from the sidewalk down to the level of the road so that someone can go across a crosswalk. Now, these were initially designed with accessibility in mind, but they benefit everyone. A traveler who's hauling around a suitcase that's on wheels, cool, they get to benefit from the curb cut. Someone who is temporarily overburdened, carrying a whole raft of groceries, both arms, and they might not have great visuals for exactly where they're stepping. Well, if they use the crosswalk and the curb cut, they don't have to worry about accidentally missing a step and spilling all their groceries when they trip. Kids who can more clearly see this is where I'm supposed to cross the street. It's much safer. With and then, of course, all of the people benefit who this was originally designed for, because wheelchairs can much more easily go down the curb cut and back up on the other side. And so generally, this phenomenon of accessibility improvements, enhancing everyone's lives, is known as the curb cut effect. And here, with with audio synthesis, with speech synthesis, I mean, I know I benefit from it all of the time. I benefit from it every day. So do my kids, so do many other people around me, whether you're using Siri or Alexa or the Google Assistant or whatever else. There's so much to this. And even when you're looking at perhaps kids who are young enough that they don't know how to read yet, who can't read yet, they can still interact in these technologies, although I will caveat that you know it should be with parentals who've provision and all that. But all the same, they can have these sorts of interactions that they wouldn't be able to previously. By making this an accessible, assistive technology, we improve everyone's lives.

Zohra:

Absolutely. An interesting example comes to my mind. We recently traveled to India, and my older kid actually volunteered at a government school where the kids are, you know, English is not the main medium. And they, of course, can speak in English and write in English, but for them to be able to use these assistive technologies and listen to the different kinds of accents. One of the examples that was shared with me was how amazing they found when they heard somebody speaking in a British accent versus versus, but this was on the fly that was being created, or they could test with technology, record themselves, and see how they would sound. I found that it can become a fun educational medium, like you said, where of course, with all the parental controls in place, you can use these assistive technologies for your own advantage. And thanks for explaining what the curb cuts are. Because again, in India, the sidewalks slash footpaths do not have cut curbs. So when I came to the US in 2000, and I'm dating myself, I found this assistive technology such it was just magical to me that somebody would even give it a thought. Because unfortunately, in many countries where I've traveled to, disabilities, and again, I'm not talking about disabilities that are just obvious to us, but disabilities can come in many forms, or you can just be inhibited, like as you said, in a moment of time when you're dealing with kids, or you are carrying many things, that could also be your you're disabled, quote unquote. But to be thinking about these different scenarios and how do we make our lives easier is something that is not given consideration to. So something that was designed for something, but it can benefit so many different people. Even people who are, I don't say this loosey, but people who may not have a disability in the more conventional understanding that we have. Right. Is being thoughtful. You're being thoughtful and you're being empathetic in your design. And empathy comes into that.

Manny:

Yeah, 100%. And frankly, that that's part of why I think so many of us in technical communication feel for assistive technologies as much as we do, because empathy is at the heart of what we do. And so the accessibility is just another extension of that. There's all the communication, there's the accessibility, I see them as very closely related.

Zohra:

Yep. All right. We are now going to talk about, I guess, how do we integrate audio synthesis into our technical communication?

Manny:

Sure. So let's start with the example that I gave earlier. You have a page, a document, perhaps a guide that you want to be more accessible or you want to have available in a different modality. Cool. You can take that content and pass it to a speech synthesis tool of choice, and then it can generate audio for that text. Now there are a wide variety of different options. The leading proprietary audio uh speech generation platform is 11 Labs, and I have experimented with their platform. I find it genuinely impressive. Be ready to pay to use it. It is proprietary, the free tier is pretty limited, but you can choose from a variety of voices and off to the races you go. And so, what that would look like is anytime you update the text. Text in the guide, you go and you regenerate the audio for it. This could happen as part of a CICD pipeline if you're using Docs's code, or it can be part of your publication process if you're using a uh more of a CMS-based process. But either way, you just make sure that you supply the text in a format that your generation tool can accept, and then it'll give you the audio. You take the audio, you embed it in whatever little player widget you would need to have on your page, and then you're good to publish. Now, if you are doing this and you have documentation that's on the more technical side, whether that's software or hardware, that sort of thing, there are a few things you need to keep in mind. You don't just want to take the raw text and pass it along if that includes things like diagrams or schematics or code blocks or any of that. Because just like you wouldn't expect to have a code block read to you in an audiobook, you don't want this speech synthesis tool to see a bit of text and then output what's effectively audio gibberish. You need to take your content, make sure that it represents the guide as you want it to be read and then pass it along. So that might mean for images, making sure you have alt text back to accessibility. For code blocks, you have some sort of descriptive text that goes along with it, so that if the code block is unavailable, like in an audio medium, then you can substitute the code block for that text instead to have it described audibly. So you have to get the text ready to become audio. It's a different modality and there are different considerations. Just like you wouldn't take a uh book and then immediately use that book as the script for a TV show. You have to adapt it. You have to do the same thing here. But if you get that process in place, if you have this alternative description for each of these different elements that don't work well in audio, then you have your content, you create what is effectively the script of the document, you have that generated and then embedded into your page. That's process number one. Then process number two is the image or the audio track generation for videos. So you have a video that you want to create, and for whatever reason, you don't want to use your own voice. Maybe it's even a business guideline because they don't want your like their videos to be associated with any one person. They want them to be associated with the business. And it means that they can have anyone on your team write scripts that can be generated, but it has the same literal voice. Well, in that case, as you are going through your video editing process, you get to create a script. You're gonna have to do it anyways. And so instead of you doing the performance, you can have the performance, you can have the audio generated. You can either do it all in one go and edit it later, or you can do it in snippets and then use each snippet of text at different places in the audio track while you're editing. Either way, if there are things that you need to change, it becomes easier to keep that script as just like a text file alongside all of the other source artifacts for this video, so that if something has to change, you don't have to re-perform it. You can't take the bit of text that needs to change and regenerate that bit of audio and cut it in just where you need it in the video. And this can actually pair well with video generation tools. If there are portions of the video that are more conceptual or that you want to be a bit more fanciful, depending on what you are describing, then you can generate video and pair it with generated audio. I'm not necessarily recommending this, it's very easy to have that go bad, but I'm just putting out a hypothetical. All of these tools need to be used with careful consideration. But audio generation, speech generation really lowers that barrier of entry when it comes to creating video.

Zohra:

It's such an interesting thought. I thought that this would be uh the entry of the barrier to entry would be higher, but you actually illustrated with great examples why it could be actually lowering that barrier. You used these acronyms CI CD pipeline and CMS. Please expand what those mean in our context here.

Manny:

So CICD is continuous integration, continuous delivery. It's an engineering acronym that effectively means when you commit your documentation changes, in our case, to your docs repository, often stored in Git with a docs' code workflow, then your CI pipeline automatically triggers a workflow and says, cool, anytime that there's a commit that I'm gonna do all of these tasks. And one of those tasks could be check to see which pages have changed. And then if these pages have changed, generate audio for them and replace the associated audio file for that page in this given location. That way you don't have to do it, like you have to set it up. But then the machine takes care of all of that automatically for you anytime you commit text to your repository. And continuous delivery, the other half of that, is effectively automated publishing, as far as we're concerned. It's a bit more nuanced than that when it comes to engineering, but for us, it's cool. Our docs, all of our deliverables automatically get published. As far as CMS, that's a content management system. That's going to be something more like Heretto is the one that comes to mind right now. These are the systems that allow you to manage all of your documentation, oftentimes in a web-based interface. That way you don't have to deal with the underlying text itself. You get nice WYSIWYG editors. What you see is what you get, kind of like Google Docs or Microsoft Word. And then these CMS tools often also handle the publishing. So they're much more one tool for your entire docs workflow sort of solutions. And so in those situations, depending on what support your CMS has for automated workflows, you would either have to hook into that or figure out some other time in your publication process to regenerate your audio.

Zohra:

Great. Thank you so much. Uh and uh some other examples of CMS that I can maybe share are Polygo. Yes. Uh even Madcapflare for that matter, it's a lightweight CMS. I'm thinking of Zendesk, maybe. I haven't used these, but I remember, but uh, I had interviewed a gentleman from Poligo a long time ago for one of my episodes. So probably those are some of the right?

Manny:

Yes. And a few more coming to mind, like Knowledge OWL, Document 360s, technically a CMS. There are a lot of these options out. There's an entire category of tools.

Zohra:

Yes, there is, there is. Um, but great examples there. And I think we've laid down a great, I would say, foundation on how to think about audio synthesis. Where, what are the examples, what are some of the considerations to keep in mind what its applicability is with all the fantastic examples that you've given. I have such a fun time. You simplified for me because I somehow, when I think about these things, I find them to be unattainable. And in my role so far, I've not used these technologies. And therefore, for me, it seems something that is I feel like the bar is high, the barrier to entry is high. But you sort of just lay down and simplify it so that I can almost imagine, like, oh, I can get started with this.

Manny:

Yeah, and getting started with it is really simple nowadays. Like I mentioned, uh, for the proprietary option, 11 Labs is where I suggest you start. It is super easy to get started. They have an excellent web-based interface for open source. If you're more technical and want to go down that path, there is the most popular ones right now are Kokoro, K-O-K-O-R-O, and then X T T S by Kokey. That's the group. Those are both open source. And there's a lot that you can do with them. They also have a much higher barrier of entry. So pick which path you want to get started with. But overall, audio synthesis is a lot easier than it used to be. There is one thing that I want to mention before we end this conversation, and that is another hot topic and one with an ethical consideration: voice cloning. It's a thing, it has legitimate business practices. If you have someone who agrees to use it, and then a business says, yes, we are going to use this voice model that we've created from this consenting actor, yes, 100%. That's totally good. Everyone agreed to it. And it can lead to very high quality, custom, tailored results. But you have to be really, really, really careful because, well, deep fakes are a thing, and voice cloning can effectively be an audio deep fake if you have taken snippets of audio that are publicly available online or were collected without consent, and then you train a model on it. That's not okay. And it is technically possible, just like it is technically possible to create a new work with or without generative AI and pass it off as someone else's or falsifying documents or whatever else. This is the same category of use when you do it without consent. And so if you want to experiment with it with your own voice, that's awesome. Go for it. I personally had myself reading the Declaration of Independence when I experimented with cloning my own voice. But I like any other tool, if you use it without empathy and consideration for other people, it can go badly very quickly. But the fact that the tool can be misused does not mean that it doesn't have practical ethical application.

Zohra:

You are reading my mind and answering even before I have the opportunity to ask the question. So I think we really know where to lead this conversation. You rather know how to lead this conversation, which is fantastic because that was my follow-up question. What are the ethical or accessibility concerns that you know we can introduce with audio synthesis into our multimodal technologies? And you gave a perfect example. This may be a random question, and maybe you know the answer, and if you don't, that's fine. But are there any libraries of voice? I'm even trying to phrase this question, of voices that are approved. For example, if an actor wanted to lend their voice and they've approved, they've already approved it, and if there is a library out there that allows you to buy that voice, so to speak, and pay a royalty and then use it in your audio synthesis.

Manny:

So that is part of what 11 Labs does. Uh they can do voice cloning. And if you pay for one of the higher subscription tiers, they have a whole host of models that are trained on different voices. And so it's not just the generic British English or whatever other models that they have available. And some of these models are trained on well-known people in the voice acting community. So that is one way where you can use models in an acceptable way for high-quality voices. Another way to go at it, especially if you're going more of the open source route, is there are publicly available data sets of audio. And if you go onto Hugging Face, which is like the GitHub for AI and machine learning, there is an entire data sets tab. And you can, you know what, I'll do it right now. As of this recording, there are 21,000 audio data sets. Not all of them necessarily video, but the top one that I'm looking at right now is Persian Voice V1. And these data sets will list their licenses to say what's their approved usage. Uh so just cruise on through if there's something that you want to use. But in that case, you're likely going to have to take the data set. And then if you're using that with an open source toolkit, then you're gonna have to create your own custom voice using it, and that's its own discussion. But if you want to quickly ramp up on this, again, 11 Labs has made it super easy.

Zohra:

Fantastic. I had no idea again. So I get to learn such amazing stuff from you, Manny, every single time we talk about. And I would love to do a deep dive. Maybe I know I always ask for your time, and I feel guilty. And you give us all this advice for free, which is again, I feel like I'm just taking advantage. I don't, but I but you're teaching the community. What can I say? We've covered such amazing ground. Before I I think this probably I might wrap up on this because we are kind of coming up on time here, but then looking ahead, we have we have talked about this, but looking ahead, what do you see as the biggest hurdles, technical, creative, or social, to fully realizing the potential of multimodal AI agents?

Manny:

Oh, that depends on how far forward you look. Uh in the more immediate term, they're already accepted by and large. People use Chat GPT, people use Cloud, people use Gemini, they pass in their videos and images and they talk to them and all of that. The tools that we are using right now. But looking forward, there is a particular application of these tools that Scott Abel actually talked about a few months ago. And these are more digital humans. And so imagine a help desk system where you are on a website, and then you can, instead of just dealing with a chat bot, you can click to have a conversation with whatever virtual agent is there. But taking it a step further, you might also click to have a video conversation. And so it, based on model or based on a person, an actor, it would have a generated video representation of somebody looking at a camera and talking to you. And it would have generated audio of whatever response the back-end system provides based on your turn of the conversation. That has application in customer support contexts. It also has application in video game contexts and being able to dynamically respond to players in that space. But getting to that point is going to be really, really, really hard to have that acceptable because both the quality of responses needs to be high in the first place, and then you have to deal with the uncanny valley effect, where something appears to be human, but is off in just the wrong ways that it triggers an innate human response to not trust it. And overcoming that both for audio and for video, and particularly when paired together, because you also have to get the lip syncing right. That I think we are a very far ways away from having it be generally accepted.

Zohra:

Interesting, interesting. Wow, this has been fantastic conversation, Manny. Thank you for again bringing all these real-world examples and simplifying it and making it more accessible, this the concept itself, and then how you can leverage this technology. Fantastic, fantastic. Any last-minute thoughts or considerations or suggestions or anything at all that comes to your mind?

Manny:

I suppose the same plea that I have given many times before, this technology in its current incarnation is new. Don't let that stop you from experimenting. In fact, use that as an excuse to experiment, even if nothing comes of it immediately. The more familiar you are with something, the less scary it is, and the more you understand its limitations when it can be used, when it shouldn't be used. And so, in whatever way you're comfortable, or even if you're not comfortable, and you can push yourself out of your comfort zone enough, experiment. Just see what you can do with it. That way you have a better understanding and can help educate others as well.

Zohra:

Thank you, Manny. I was looking at your background and I just found it so symbolic. There's a little Lego box sitting there. It's a yellow Lego box, and you think about building blocks and you think about experimentation. And to me, it signifies and represents exactly that. And it's so apt for our conversation to do a deep dive yourself, to build on your own and understand before you can, I guess, take larger leaps with it.

Manny:

You don't have to become an expert, you just have to become educated.

Zohra:

Educated. Uh lit the literacy part of it is so important here. Exactly. Thanks, Manny. I really, really appreciate you. This is my bow to you. Uh thank you for your time on a Friday. I know we had to cancel our previous recording for this, and you took the time and got back on my calendar. You're always available. I really appreciate you. And please continue to spread your knowledge. We need it. We need you. You are the next frontier. How about that? For me. Thank you so much.

Manny:

I didn't say it, folks. Thank you for having me, Zora. As always, it was a pleasure.

Zohra:

Always a pleasure. Thanks, Manny. Listen to Insight Techcom on your favorite lab and follow me on LinkedIn or visit me at www.inside techcom.show. Catch you soon on another episode. Thank you for listening. Bye-bye.