S5E5 Mastering AI-enhanced Content Creation using Retrieval-Augmented Generation (RAG) with Manny Silva Artwork

Inside Tech Comm with Zohra Mutabanna

Inside Tech Comm is a show for anyone interested in learning more about technical communication. It will also be of interest to those who are new to the field or career-switchers exploring creative ways to expand their horizon. You can write to me at insidetechcomm@gmail.com. I would love to hear from you.

All Episodes

Inside Tech Comm with Zohra Mutabanna

S5E5 Mastering AI-enhanced Content Creation using Retrieval-Augmented Generation (RAG) with Manny Silva

May 30, 2024 • Zohra Mutabanna • Season 5 • Episode 5

Can outdated training data and hallucinations in large language models (LLMs) hold back your AI projects?

Manny Silva joins us to uncover Retrieval Augmented Generation (RAG), a technique that improves the output of LLMs. He shares practical insights with real-world applications that ensure AI-generated responses are accurate and contextually relevant.

The future of content creation is going to be automated. Manny reveals how we can cut down the time required to produce first drafts by using developer tools and open-source communities. Learn about integrating engineering and design documents, and templates from the Good Docs Project, all optimized for LLM inference.

This episode also explores the delicate balance between AI and human creativity in writing, emphasizing that while AI can handle initial drafts, human input is invaluable for refinement. Tune in for a compelling discussion that will transform how you approach AI and content creation.

Link to the video demo for RAG. The demo runs from 43:10 to 49:30.

Guest Bio

A tech writer by day and engineer by night, Manny Silva is Head of Docs at Skyflow, codifier of Docs as Tests, and the creator of Doc Detective. He’s passionate about intuitive and scalable developer experiences and likes diving into the deep end as the 0th developer.

Resources

Engineer-focused article on RAG for Stack Overflow - Manny Silva
Taking RAG to Production with the MongoDB Documentation AI Chatbot- Ben Perlmutter from MongoDB

Show Credits

Intro and outro music - Az
Audio engineer - RJ Basilio

Zohra: 0:04

Hello folks, welcome to Season 5 of Inside Tech Comm with Zahra Mudabana. This season we are focusing on tools, tips and strategies to elevate your craft. Let's dive right in. Hello listeners, welcome to another episode of Inside Techcom with Zahra Mutabana. Today we have Manny Silva. If you don't know, we've already recorded with him. But just a brief intro. A tech writer by day and an engineer by night, manny is the head of docs at Skyflow and a codifier of docs as tests and the creator of Doc Detective, and in fact, this is the demo that he has given to us in episode two of season five. So if you haven't checked it out, definitely check it out. He's also passionate about intuitive and scalable developer experiences and loves to dive deep into the deep end. So with that brief intro, manny, welcome to my show again. I'm so excited and looking forward to this altogether new conversation.

Manny: 1:06

I'm so excited to be back. Sora, Thank you for having me again.

Zohra: 1:09

Thank you. I have to say that the demo that you gave us was very well received, so congrats on that, and thank you for giving me another opportunity to pick your brains.

Manny: 1:20

My pleasure, my pleasure too.

Zohra: 1:37

equally, now we are going to be talking about retrieval, augmented generation, something my pleasure right or abbreviation.

Manny: 1:41

Acronym yes, and yeah, that's totally fine. Rag is how you're going to see it written about generally, so we may as well use it here.

Zohra: 1:54

Awesome. But from what I understood, based on my little research, I'm just going to read out the definition, quote unquote, and then we're going to dive into what it is For the benefit of our listeners. Retrieval augmented generation is a strategy that helps address large language model hallucinations and out-of-training data. Rag augments the knowledge of these large language models by retrieving relevant information for user queries and using that information in the LLM-generated response. Now, this is very technical and this is what we are going to pick apart and talk and do a deep dive into users. So no stress there. We are going to discuss as much as we can and pick this apart. So, Manny, over to you. Please tell us a little bit about yourself for those who haven't heard your episode before, and then we can quickly dive into RAG.

Manny: 2:42

Sure. So I like to tinker, and while I am head of docs at Skyflow, I like to get my hands dirty, and one of the projects that I got my hands dirty with is building Skyflow's internal AI toolkit, and what a lot of that revolved around was a retrieval augmented generation strategy. Now let me take a step back. Rag came about because LLMs have a knowledge problem, and the knowledge problem really comes down to two main points, and you already touched on them a little bit, but let's reiterate. First, training data sets are out of date for LLM models Some are more out of date than others depends on the model that you're dealing with and two LLMs make stuff up when they don't have facts, aka hallucination. They try to fill in the gaps with what they think is most probable, which is why they sound very confident when they're making stuff up. Now there's one primary strategy for remediating this, and that is, in essence, retrieval augmented generation. Ragged and now a really fun way to see rag in action is go to chat gpt a 3.5 model and ask what is rag question mark, and it might say and I have done this before, so I know the exact answers that it can give it might say, oh, it can refer to different things depending on the context. Is it red, green, amber, is it RAG analysis? Is it random access generator? Is it resource allocation graph? But if you are using RAG, you might get the response of RAG or retrieval. Augmented generation is a method introduced by meta AI researchers. That, and then you know, goes on to what you already said, zora, but the short version of it is RAG is a way of supplementing the LLM default knowledge with known good context. If you see tutorials about, oh, chat over your documents or that sort of thing, that's rag in a nutshell, because it's taking information that the LLM doesn't already have and including bits of it in the prompt, so that the LLM has known good context to inference with. And so what that might look like is oh, hey, manny, I want to ask an LLM what is Skyflow? Well, an LLM might not know that because maybe it didn't make it into the training data, and the training data for this model is only updated as of I don't know, january 2022, which is a real thing that happens. Well, if I want the, I don't know January 2022, which is a real thing. Well, if I want the.

Manny: 5:33

Or you know, oh, hey, manny, I want to have the LLM be knowledgeable about a particular procedure for X, y, z product, but oh hey, it's not in the training data. How do I do that? Well, you get your whatever content that procedure is defined in and you can use a RAG strategy to have your LLM orchestration system whatever you're using to find the necessary data from its source and include that with your prompt so that the LLM has the known good context to respond with. I know that sounds really complicated. We're going to break it all down into various components, but it lets you chat over your documents. It lets you make API calls and fetch data that the LLM doesn't have access to, so that the LLM can know it and respond to your question appropriately.

Zohra: 6:17

That's awesome and I did try it out as well. The 3.5 version of ChatGPT did give me something random and I didn't bother to read it because you'd given me a heads up, but you're right about that, and GPT-4 exactly told me what RAG was, so there's definitely you can see that in action, which is good for people like me who have I'll be honest, I have not encountered RAG before until you and I talked about it and how it can help you. I suppose in large companies, they are probably already doing including drag in their architecture, but this may again be relevant to smaller organizations where there is a smaller tech writing team and they are venturing out into the artificial intelligence territory. So that is our audience that we are speaking to Now. You mentioned that. Yes, it was very technical, so I want to break down that architecture, and you mentioned orchestration tools and some things that I probably already forgot, so let's do a little bit of picking up out of that architecture or the framework. Is it an architecture? Is it a framework? What is it?

Manny: 7:21

first of all, I think of it as that there are multiple parts of Arrive architecture, so what it comes down to is you've got your orchestration layer that ties all of the necessary tools together, that, if you're building it yourself, that might be something like Langchain or Semantic Kernel or native Python or JavaScript code, whatever you're writing, or native Python or JavaScript code, whatever you're writing. But then so, orchestration layer, piece number one. Piece number two is the LLM itself, and that you know. That's whether you're using the GPT models from OpenAI, anthropos, cloud models, some locally run model on device, whatever. An LLM is piece number two, and piece number three is your retrieval tool. So that might be a knowledge base or a vector database or APIs. That's where you fetch the additional information from, and what ends up happening is well, let me use OpenAI as an example.

Manny: 8:22

Chat GPT is actually the orchestration layer. Chatgpt lets you input a user prompt, it lets you upload documents if you're on a paid plan, and it creates its own little knowledge base, a vector database internally with whatever you upload, and the LLM on its backend is the GPT model. And so what ends up happening is the user goes in and inputs a prompt hey, what is retrieval, augmented generation and say you're on GPT 3.5 and it doesn't know what it is. If you've uploaded a document that defines what retrieval augmented generation is, that it will go and query that document and say oh hey, is anything similar to the user's input existent in the uploaded documents? It will retrieve the closest matches and pass those along to the LLM along with your query. That way the LLM, even if it doesn't have the answer in its training data, can infer against.

Zohra: 9:21

Excellent. That is beautifully explained and I was able to really understand how, at a very high level, the architecture works. I think we are good to go with this.

Manny: 9:31

Cool. There is a piece that I'd like to dive into a bit, and, because this is what really most pertains to content offers, and that is, I want to demystify what a vector database is.

Zohra: 9:46

Yes.

Manny: 9:47

And I want to just share a little bit about how our content goes from being our content to being ingested in a rag architecture. So, in a nutshell, a vector database is a database of numbers associated with text. Vectors are a way of representing chunks of text by their relationship to each other, very highly mathematical. I don't pretend to understand the nitty gritty of all of it, but suffice it to say you can pass a string of text into an embedding model and it generates numbers based on its understanding of the text, and those numbers are stored, along with a literal text, in a vector database.

Manny: 10:34

Okay those numbers are interchangeably called vectors or embeddings.

Zohra: 10:38

I see.

Manny: 10:39

And so what ends up happening is let's go back in the process a good bit, now that we know what a vector database is.

Manny: 10:48

Let's say you have some markdown, and what ends up happening is you have to do a number of steps to get your markdown into a format that is reasonable to be stored in a vector database. Markdown files and files of any type really can be be very large depending on how they're architected, but vector responses or data stored in vector databases tends to be very, very small. So what you have to do is you take your source content, wherever it is and whatever format it's in. You have to clean it up, make sure you get rid of extraneous markup, make sure you cut it down to only the necessary text that you want to eventually be passed to the outlet. Once it's cleaned, then you have to split it or chunk it. Those terms are used interchangeably as well, and there are a lot of tools that help you do this, like llama. Index is a very popular one uh, not associated with meta's llm, aside from the name but linkchain can also help you do this, and what it does is it takes the text of your content and it splits it into sections, into well, chunks, so that they are small and digestible. You need them to be small so that they can fit into the LLM's inference window further down the line, and we will talk about that a little bit later. But it splits your document into little chunks and then it takes, takes those chunks and embeds them. It generates the vectors and then stores them into a vector database. So in the end you have fragments of your original document stored both as literal text and numerically in the vector database. But what ends up happening is when an orchestration layer goes to fetch information from the vector database, like if I say what is RAG, if that's my input, the orchestration layer will take what is RAG and see if there's anything that matches or has a close match to that in the vector database, by using the vectors and doing what's called a similarity search, seeing if there are any chunks of text in that vector database that are similar enough to the input. And if they're similar enough, then it takes those chunks and can send them off to the LLM.

Manny: 13:21

Now why am I explaining all of this? Because it's important we understand how our content is being used so that we can optimize it. Because if, like using my example, what is rag? If there are no similar matches, then you're not going to get anything back and the llm isn't going to have any known good context inference against and you're not going to get anything back and the LLM isn't going to have any known good context inference against and you're not going to get a good response. So there is a lot that we can do as content authors to at least interrogate the process of chunking our content better, to own part of the process of chunking our content, to make sure that we optimize it as much as possible.

Manny: 14:08

And there are a lot of things that we can do to optimize our content. First off is we need to make sure that, starting from all the way down at the beginning of this process, we have our source content, that it gets cleaned properly, that whatever the cleaning process looks like, it doesn't remove too much content. If, for example, we have code blocks, where you have code blocks in your content, that it doesn't remove those, because those are important. But also you need to make sure that your content has adequate context. So when you're splitting the content, you need to make sure that oh hey, this code block isn't all out on its lonesome, because usually code blocks will have an introductory sentence or paragraph that adds a lot of context and meaning to what the code block actually does. Does the code block have comments that help explain what the code does?

Manny: 15:06

But also, even if you don't have code blocks, like, okay, you've got a random chunk of text that is largely devoid of context. How do you give it context? How do you help the LLM understand where this text belonged before? You can do things like adding metadata. If you have, like front matter in Markdown, you can add front matter to each and every chunk to say, oh hey, this chunk of text belongs to this file, this chunk of text existed underneath this heading, underneath this subheading. You can even go the extra mile and provide like, create a summary of the page that it came from and embed that summary as a front matter as well, so that, even if that particular chunk is all that was returned from that page, the LLM still understands the context that it appeared in. And by doing that, by providing keywords by you, by a lot of these SEO type things that many of us already do, all of that information can help the LLM better understand what you're giving it and help the LLM therefore give you better responses and not make stuff up.

Zohra: 16:21

I want to, I guess, simplify this a little more, or rather break it down a step further.

Manny: 16:27

Sure.

Zohra: 16:27

As you were talking about chunking the data right, how you clean it up. Now, one of the efforts that I believe is happening at companies is we are starting to look at the content that we have closely to make sure that it is usable or we can align it with our future AI strategy. And however that content is going to be presented, whether through a chatbot or whatever form it takes, Right. Right, so we are starting to look at huge amounts of data. Given that challenge, what do we do where there is no metadata?

Manny: 17:02

So the secret is that there's always metadata. You just need to know where to look, and you're already working with LLM systems and they can help. So part number one is identify who's putting together your RAG system and who's the one who's going to be consuming your content. Partner with them, partner with this engineer or data site and say, hey, I know how to optimize my content in your pipe. Let me help you as far as where you find the metadata. You've got the file name, you know the url where it's accessible from assuming it's externally available. You know the headings and the subheadings and the figures that the content relates to, and so, as you're working with this engineer or data scientist or whoever, say, hey, look, I need you to make sure that you include the file name, the accessible url, all of the context that the content appears in.

Manny: 18:05

If you think about it this way, if you have, like a webpage, that's an article, there's a body of the content and that's great. It's a lot of content. But there's a lot outside of just that raw content that is communicated by the website. There's the hierarchy of where it appears, all of the ancillary. What product does it belong to? It's often communicated by, like, the upper navigation or the side navigation, what are related pages? All of that is relevant metadata, and so, by identifying all of these things that are communicated to us visually or, you know, via whatever accessibility tools we use, we need to find ways of seeding that same information to the LLM for each chunk of data we present to it.

Manny: 18:53

And the second part is we're working with LLMs. If there's an LLM that you've been approved to use your content with, then you can say hello, llm, provide a summary of this for me, please, in a sentence and include that as like, just like a summary description, or like you can say oh hey, summarize this whole page, and here's the page summary. And then summarize this chunk, and here's a chunk summary. And all of that helps the LLM identify what it's working with, helps the RAG system identify the chunk in the first place, because it can match against the metadata. And you can even do things like oh hey, what are five keywords for this chunk? There you go More metadata, use LLMs to your advantage. The content is there. We need to use the systems to help us optimize further consumption into those systems.

Zohra: 19:52

Beautiful, excellent. I feel like, okay, I can do this now and I feel good about whatever data we have and use that as a great starting point to build something better out of it.

Zohra: 19:58

I feel really confident about this actually Good, good yeah you know, because you're right, there is a lot of inbuilt metadata and we can use the LLMs to glean what information we can, because we already have the raw data in front of us and we can probably look at metrics to see, okay, what are users searching on? And glean from that data too, and use that as your metadata.

Manny: 20:22

Oh, totally.

Zohra: 20:23

Right. So I'm thinking there are many ways to build that metadata. You've got my creative juices flowing, manny, awesome. I think we also touched upon why technical communicators need to consider this, because we are content creators and we need to be plugged in to this process, and by simplifying what is available to you, I think we can invite more of our community to try it out, to experiment or at least become knowledgeable about yeah, and I feel it's very, very important for us, as technical communicators, to own this.

Manny: 21:00

I mean, llms are trained on content. We're the ones who create that content, and if we don't find a seat at that table, then, well, they're not going to consider us. And so by showing that we have domain expertise, showing them yes, you need us along, because there's a lot that we know about this that you're not considering, we can help you, help yourselves. Then they will see the value in us. I mean, that's part of how and why I did what I did at Skyflow with our internally AI toolkit. It's like cool, I write the docs, I own the docs, this stuff runs on my docs, so I'm going to own it, thank you.

Zohra: 21:43

That's awesome. Now why don't you share a little bit more about how this came to be at Skyflow and just a little bit about your journey?

Manny: 21:51

Sure. So I mean, that's really it in a nutshell. I like being on the cutting edge of things. I see, ooh, shiny new tool, I want to see how I can make use of it. And so all of this LLM-ness was happening. Chatgpt had just come out and there was more developer tooling starting to come out around this in the open source communities, and I wanted to see how I could leverage it For me.

Manny: 22:18

In particular, I wanted to see how I could optimize my content creation, and my biggest issue was I have a small writing team, as many of us do, and I wanted to figure out cool, how can I reduce my time to first draft, what does that look like? And can I create a system that is aware of Skyflow and all of the intricacies of how Skyflow works? And so what I ended up doing was I took my initial POC was two weeks and I took two weeks to do nothing but this, and I, using the tools available in the open source community, I figured out how RAG worked. I took all of our documentation white papers, blog posts and I created a vector database, so that whole pipeline that I explained earlier about the cleaning and the splitting and the embedding. I did all of that with all of our external materials and I had a vector database and that was great. And then I ran against a number of LLMs and I refined how my retrieval from the vector database worked and I was able to ask questions from this system and get reasonable answers that I couldn't get from ChatGPT, that I couldn't get from systems unknowledgeable of Skyflow processes, and I could chat over the entirety of my content and that was great. But that didn't help solve the problem, my biggest problem, which was how do I accelerate my content creation without dealing with hallucinations day in and day out? And so what I ended up doing from there was I needed templates. It's like, okay, I know that I want to output content, what kind of content do I want to output? And I was already leveraging some templates internally.

Manny: 24:14

But I turned to the greatest open source of documentation templates that I am aware of, the Good Docs Project. They're great and, for those unfamiliar, the Good Docs Project has a curated set of templates per doc type. They have both the templates themselves and guides for how to use the templates, and I took that information and I took those templates and I optimized them for inference with LLMs. So instead of just oh hey, heading kind of content that might appear here, heading, so on and so forth. I added placeholder instructions for each section. How should the LLM interpret any additional information? I give it and apply that information to this given section and I provided multiple examples so it actually knew what content for that kind of this given section. And I provided multiple examples so it actually knew what content for that kind of section looked like. It's called multi-shot prompting, providing multiple examples to give the LLM an idea of what it should output. And I did that, for I created these inference templates for how-to guides, for conceptual overviews, for tutorials, for use cases and a handful of other things, and so that was wonderful. I had these templates that were good for inference, and then I had the already available documentation in a vector database.

Manny: 25:43

But then I ran into another issue and that was okay, I have existing content and I have templates for new content, but where is the source of the new content? Because if I'm developing a new feature, what does this look like? And so I just sat back and I was like, okay, we have PRDs and we have product requirements documents and we have engineering design documents, we have API design documents and those have all been vetted and reviewed, so those are known good context. So I took it a step further and I built into my system a way to upload arbitrary files and create a second vector database out of the uploads. And then what I ended up doing was knitting this all together and so I could essentially say, hey, I want to upload this file for I don't know service accounts and how they work in Skyflow, and then I want to output a how-to guide and then give it a little bit of extra instruction, just letting you know what I uploaded, so what context it's going to consider.

Manny: 26:55

And then I had it iterate through all of the content that I uploaded, like the entirety of all of the docs that I uploaded, applying all of that information to the templates that I had already created, and then also go back to my initial vector database and search for additional information regarding service accounts in this case, and take all of that known good context that was already public and apply that to the template as well, and the output in the end is takes about five minutes, but I have a reasonable first draft of content based on whatever I upload, and as long as the content that I provide is good, then it is a starter point and what I've been able to do is take that and, before I do any interviews with the necessary engineers or whoever else, I send that AI draft to the engineer and say, hey, vet this please, and if there's anything missing, let me know and just leave comments and then I get their response and it's cool.

Manny: 28:04

Here's a first draft with known good context and it's marked up by a subject matter expert. Now I can take this and do a full first draft or a full revision, and what that's let me do is, as long as I have that good starting context, I reduced time to a real first draft from two weeks to two days.

Zohra: 28:24

This doesn't sound true, but you have proved that it is possible. Now, like you said, you are at the cutting edge. You like to be at the cutting edge of technology For someone who is wanting to do this but does not have either the resources or the access, or if they want to learn and they are at a beginner level, are there resources that you can point to for them to get started?

Manny: 28:55

Yes, so I did this early 2022. I did it the hard way at the hardest time to do it. Now, many people are familiar with Tom Johnson's blog. I'd Rather Be Writing. He covers many of these subjects in more recent posts that he's put up and that it's not quite the same way that I've gone about doing it, but it's similar enough and you can use many of these similar strategies in chat, gpt or whatever LLM you want to do. You don't have to get into the nuts and the bolts of all of the engineering stuff. You can use the tools that are available to anybody. But it's still really important that you understand how this all works, because if you end up on a team like Docker, just put out chat over their docs like AI search or whatever these people are talking about or calling it. Lots of companies are putting out these sorts of systems and if you don't understand how your content is being used, then it may be misused in these systems or not deliver good results.

Zohra: 30:02

Very important point. I think it's definitely something to keep in mind. The one question that comes to my mind is, as you were taking me through this process is this scalable?

Manny: 30:14

That depends on your infrastructure. It also depends on what you mean by scale. If you are talking about cool, can we use RAG for serving an AI search for our documentation? Yes, 100%, that's scalable. That's what RAG excels at. For that sort of customer-facing content retrieval and summarization, rag is perfect. There's no other way to do it other than RAG and, frankly, scalable. Even from a compute and environmental standpoint. Rag is scalable.

Manny: 30:48

One of the things that some people have been saying recently, with models that have gigantic context windows, like Claude 3 and Google Gemini Pro Gemini has like a 1 million or 1.5 million context window is something ridiculous. You can fit multiple books in it. The issue is that the more tokens you put into a model's context, the longer it takes to run, so the more compute it takes and the more energy it takes, and so what RAG lets you do is, instead of shoving the entirety of a book or the entirety of a doc set into every single inference, it lets you select only the things that you want, only the things that you need for the given query. So it reduces compute time and reduces compute cost.

Zohra: 31:40

Okay, I think that answers my question, and for me, yes, scalability meant all of those things, so thank you for addressing it. The challenge that I can, I imagine, running into in my current role is we are many writers and some of the projects that we are assigned to overlap. It's an entangled web, and when I think about scalability, I'm thinking about all these products that are intermingled, that intersect each other. How do we parse that information? How do we increase the efficiency of the inference? I don't even know if I'm making sense with my question, but those were the questions that came to mind. That's what I meant by scalability in that context.

Manny: 32:21

As far as multiple people making multiple updates across multiple different products and doc sets and having that scale with as far as usability and you have to figure out how to optimize that you can do something. That's pretty naive, not to use the term negatively, but I mean if you could do something like oh hey, anytime anybody makes an update to any of the doc sets that are a part of this vector database, rerun the whole thing, resplit everything, re-embed everything, reprocess every bit of doc. That's not reasonable. We don't do that with things like databases and we shouldn't be doing that here. What as far as scalability means in this context? We need something that's a bit more intelligent about it. It's like oh hey, look, this particular file had an update, so I'm going to reclean, resplit, re-embed this file and update the equivalent chunks in the vector database instead of rebuilding the entire database from scratch.

Zohra: 33:40

Okay, okay, that makes sense. That sounds like a very logical approach. And working in chunks? Okay, that makes sense, that sounds like a very logical approach. And working in chunks, even outside of an LLM, I think, is critical here, so that metadata comes in handy. I suppose yes, right.

Manny: 33:58

Yeah, because if you know what file a chunk came from, then you can just say, hey look, this file updated. Let's go in and invalidate all of the chunks that originated from that file. And here we go, split and input the new ones and you're good to go.

Zohra: 34:08

Yeah, I think this sort of speaks into, or rather dovetails beautifully into my question, which was going to be how do we manage version control and updates? So I suppose what you already offered would cover that scenario.

Manny: 34:21

Yes, and it depends on what content you want to be live, when and knowing where the vector database is going to be used. For example, if the vector database is going to be used for customers customers are going to be interacting with it then well, whenever docs go into production, then you update the vector database. Docs go into production, then you update the vector database. If it's something that's going to be internally used, like the toolkit that I wrote at Skyflow, then it can be. Anytime something gets checked in, even if it's not public yet. Cool, we can update the vector database so that people internally are aware of things, even for new features, before they're live.

Zohra: 35:02

Now. Have you automated that process? Because I'm thinking running that vector database at the moment sounds like a manual process. Have you automated that process? So let's say there's a PR update, the PR gets approved. Can we set up that pipeline where the vector database is triggered, updated and now you don't have to? There is no manual intervention there. Have you done anything automated to? There is no manual intervention there. Have you done anything automated there? Yes, please share with us.

Manny: 35:30

I mean. So the way that it works is I'm not manually creating the vector database at any point in time. I wrote a script that goes and fetches all of the necessary content and does all of the work for me, and so all that has to happen now is, whenever a PR gets merged, it just runs the script again in CI and it goes and rebuilds the vector database and check the vector database in wherever it needs to be checked in. And then the next time there's an update to my backend service for the tooling that I created, then it has the new vector database available to it.

Zohra: 36:07

Awesome. Now this begs the question Does this introduce a sense of job insecurity for your team members, because now you've accelerated the content creation process, you've automated many of the review steps that you would require, or has this opened up opportunities?

Manny: 36:25

For me it's opened up opportunities because it's me and one writer and there was no way that we were covering everything that needed to be covered for, and so this has helped us catch up a good bit. But the biggest thing is that it has helped other people contribute more, because there are a lot of people who didn't feel comfortable because they writing content to help out, because they weren't aware of our style. They weren't aware of best practices for writing and how to go about structuring content or filling in content in a given structure, and so what this has enabled is PMs or whoever else who want to contribute, who want to see the docs flourish. It's lowered the barrier of entry for them. So I actually have more people helping out creating the docs than I did before, and I have taken the position more of content curator and it's like okay, cool, yeah, you're helping create the content. Okay, I'm going to be your editor. I'm going to help steward content into a good shape.

Manny: 37:29

Actually do the reviews. We actually have more reviews than we had before, because I need to make sure that the content is up to snuff before it goes out to the public, and to further help with all of this, I use mail for automated PR style checking and that sort of thing, and that helps my contributors as well, because the computer tells them oh hey, here's how you fix this style issue, and so they're hearing it from the computer and not from me. So overall, personally, I found it very helpful. I have. My writer is very excited about our system and likes using it. Other people internally like using it and it hasn't. It's provided more opportunity, not less.

Zohra: 38:13

That's awesome, hasn't? It's provided more opportunity, not less. That's awesome because it has. The more you hear about ai gendered of ai especially I guess the common understanding at the moment is it's going to take our jobs, it's going to replace us or it's going to. The positive is you're going to free up your time and you get to do more creative stuff. With your example, you've actually made that a reality.

Manny: 38:34

Yes, and don't get me wrong, there are still rough edges to smooth out, of course, but yes, yes, I mean, I am one of the people who has had a profound, who has had AI make a profound positive impact in my day-to-day work life.

Zohra: 38:51

It's good to hear because, I mean, there's been a lot of hype about it and everybody wants to jump in on that bandwagon and say you can do all these cool things. But it seems like those grand ideas are starting to die down or people are realizing what the limits of AI are or what the potential is, and that discovery is really starting to now happen. There is that creative conversation that's starting to happen, and what I'm taking away from this conversation is that human intervention is not going to go away.

Manny: 39:21

No, absolutely not, Because even with all of the work that I've done, the first draft, the AI draft that it comes up with, is absolutely nowhere near publishable, is absolutely nowhere near publishable. It is even after the experts review it and I get content back. It needs so much help Like it's in roughly the right shape. But that's about what I can say for it. I know it has good content in good context and it's the right shape, but it still needs help being readable. It needs help making sure that it conforms to our style, that it fits in with the tone and the structure of the rest of our content. There is so much more that I provide to it. This is just the way of speeding up that first draft, of solving that blank page problem that had been taking so long. But the real bulk of my work, the meaningful bulk of my work in the revision, the polishing, that hasn't changed one bit.

Zohra: 40:25

And I think I would probably also go back to what you touched upon providing that context to the system. Without that the AI wouldn't have no context at all. So you had to feed that context in. So there is that human intervention that's happening and without that the LLM is really again useless junk, I would say.

Manny: 40:45

Yeah, more or less. I mean the way I think about it is AIs, llms, very specifically, are like untrained but very knowledgeable interns. They don't have a lot of experience, but if you give them very detailed instructions and provide them absolutely all of the necessary context to complete the task at hand, then they can do it. But you have to give them all of that up front. It's not magic, they don't. You can't just say, oh hey, go do this thing and not provide guardrails or requirements and expect it to return. What you want you have to put in the work. But if you put in the work, if you structure it so it's repeatable, then it's just really advanced automation.

Zohra: 41:32

Fantastic. We have covered a ton of good questions. I've actually run out of my questions. I'm actually really getting all excited about the demo. So, before we dive into the demo, anything that we've missed that you want to add.

Manny: 41:47

No, no, I think we've mostly touched on it. Just to reiterate that there's a lot of tooling out there. There are a lot of good resources out there, but they augment, they don't replace, and you don't have to use any of this. But it can make you a lot faster if you work it into how you do your work and it can improve a lot of the processes if you take the time and a bit of investment up front to figure out how it all dovetails together for you and your particular use cases. Please don't get as into the weeds as I did with the engineering stuff. There are far easier ways to do it now than there were then. But again, just know your value and advocate for yourself when other people are using your content.

Zohra: 42:37

I love the optimism in your suggestion, but I also want to say maybe you should provide some coaching for the non-technical amongst us, because I think it would really benefit us. You know, just going and trying to glean all that information online and trying to do this on your own. Having somebody to coach you through would be awesome, and I know you're presenting at several conferences and my bad luck that I don't get to attend. However, this is excellent. I really feel very energized by this conversation, but let's dive into the demo.

Manny: 43:10

So this is Verba, gpt. This is Verba is my writing team's name. Verba is Latin for words and GPT because why not? Everything's GPT nowadays.

Manny: 43:24

So what we see here is my interface that I wrote in Streamlit, and right now we're just doing chat and we could do something like what is Skyflow? I do this because you can see that the response is going to be knowledgeable about Skyflow itself. So it's going to be a different response than you might get from other models that have included Skyflow, or as much about Skyflow and its training data that have included Skyflow, or as much about Skyflow in its training dataset. And so here we can see oh hey, skyflow is a data privacy platform, does all sorts of stuff, but we have SDKs including Android, ios, yada, yada, yada. That list is accurate. It's not hallucinated. We can indeed help with GDPR, hipaa, pci, dss, and all of this is informed by what Skyplot can do, because it's pulled from our documentation, awesome. And then you can continue asking questions about it, like what is a service account and so on, and it's going to do the same thing on the back end. What's happening? This is my orchestration layer. It's going to do the same thing on the back end. What's happening? This is my orchestration layer and it's going and taking my prompt and then going and querying my vector database via RAG, and that is what is supplying the chunks of context that's then being passed to the LLM, and the LLM is considering all of that when it creates the response which is output here.

Manny: 44:58

And we can do more. We can come over here and I'm going to grab a file. So what I'm taking here is this is Facebook's paper on retrieval of augmented generation. Facebook's the one that came out with it. This is the paper that announced the strategy and it's available online, and so what I'm going to do is I'm going to upload it here and then I'm going to open up a peek under the hood to show you what chunking looks like, to show you how this PDF is chunked.

Manny: 45:30

So give me just a moment. I'm going to upload this file and, ta-da, it's uploaded and I'm not going to infer against it, because it's a really big document and that would take forever, but just to show you under the hood what this looks like is. We've got custom text and so we split it out here, and all of these items here let me find a good one here is they are chunks of text taken from the PDF and you've got the page content itself. We use splits from or to evaluate rags generations abilities in a non QA setting. What it does is it took the PDF, it used it, extracted the text from it and then it took that extracted text and split it into all of these chunks. But if you look, there's a raw page content itself, but there's also all of this metadata metadata right.

Manny: 46:26

So we've got the source is hey look, the PDF that I just uploaded, and then it has coordinates as to where in the page that it got this chunk from. The file name that exists from source can also be URL in certain cases that way cases so that the orchestration layer can know oh hey, this is where this chunk came from and it can provide citations, that sort of thing. And so there's a lot that you can do based on the source content lots of different kinds of metadata that we've talked about, but just from a random PDF, this is the metadata that it extracts. This is the kind of chunk that would be passed into the LLM in about this format so that it would know what to work with to augment its existing knowledge.

Zohra: 47:08

Awesome. So now that you've uploaded this PDF and it has been chunked, can we query and see what it outputs?

Manny: 47:16

We can, but it would be. You know, let's find out how long this takes. I'm not sure we have enough time for it right now. Pdfs are particularly difficult, but we'll see how it goes. What is RAG? So it's thinking, generating response, and so right now what's happening is it just got a whole bunch of chunks and it's sent it off to the LLM. But because there are going to be a lot of hits, there are going to be a lot of matches. It's taking an extra long time to run inference because there are more tokens involved than what there were before.

Zohra: 47:53

Makes sense. I think it was important for me to understand what's going on with the new data that just got uploaded. This is what you meant by inference.

Manny: 48:01

Yes, so inference is the process of asking the LLM for a response. But while we're here, let's see if this other part works. So now I'm coming over and I'm switching into draft mode and, using the same uploaded context that we already had, I'm going to come in and say, hey, give me a conceptual overview. And no additional instructions go. And so now what it's doing is it's iterating through all of the text that it extracted from the uploaded document and it is applying it to the conceptual load view template that I've defined from the good docs project, like I mentioned before.

Zohra: 48:44

Yeah.

Manny: 48:45

And this is a little bit different than traditional rag, because instead of selecting just a couple of results, it's literally iterating through the entirety of the document. But the point stands, it's still split into chunks, it's processing one chunk at a time, applying it to the template, and here we go.

Zohra: 49:02

That's your first draft.

Manny: 49:03

This is my first draft understanding retrieval augmented language models.

Zohra: 49:07

Brilliant.

Manny: 49:09

And so here we are. We've got an overview Subheading. By the end of this article readers should understand key concepts. And then it goes into each of the necessary subheadings retriever versus generator which is effectively? Retriever is your vector database and the generator is the LLM. And then training and inference in various use cases. And we did it live.

Zohra: 49:30

Fantastic. This is great. Nani, Thank you so much for walking us through what you've built. My pleasure. To see this in action is definitely very rewarding actually Fantastic it's rewarding to be able to show it off. Well, it's well-owned, why not, Manny? Thank you so much for the demo and walking us through what RAG is, I think. An hour later I feel like, okay, I understand and I can speak a little intelligently about RAG now.

Manny: 50:01

Thank you for educating me.

Zohra: 50:03

Thank you for sharing your knowledge with us and bringing these creative ways of how technical communicators can be advocates in their roles, and I think what I've been saying all along is to push the envelope, to try and be creative, and you're you're speaking to that, so.

Manny: 50:20

I definitely exhibit my creativity through tinkering and I'm glad that that could be of use to other folks. So you know I'm going to. Everybody knows where to find me. I'm available on LinkedIn. I'm friendly, I promise. Feel free to reach out if you have any questions.

Zohra: 50:37

Subscribe to the podcast on your favorite app, such as Apple, spotify or YouTube Music. For the latest on my show, follow me on LinkedIn or visit me at wwwinsidetechcomshow. Catch you soon on another episode.