Audio Kit.

Audio Kit Overview

Audio Kit is a set of tools built with Claude and the WebAudio API to streamline model evaluation and testing. I built these while working on Adobe Podcast to automate time-consuming tasks like audio mixing and speech-to-text transcript review.

Track & Stem Mixing Tools

I built tools to simplify audio comparison and remixing. One tool generates an alternating preview of original and enhanced tracks, making it easier to assess improvements without manually switching between files. Another allows real-time mixing of isolated speech, background, and reverb tracks—enabling experimentation without upfront engineering investment.

Speech-to-Text Comparison Tools

Manually reviewing speech-to-text outputs is tedious and time-intensive. I built a transcript comparison tool using Claude to speed up this step. It allows side-by-side analysis of two transcript versions. It highlights key differences, including repeated words, filler words, numeric data, and other attributes, making it easier to identify discrepancies with problematic content types and refine model accuracy.

Audio Track Mixing

Comparing original and enhanced audio requires manually switching between files, making it difficult to assess improvements. This tool allows me to automate the process by generating a track with alternating segments of original and enhanced audio, allowing me to listen to it without manually switching back and forth. I used Claude to build this leveraging the Web Audio API for audio manipulation, AudioContext for real-time processing, and the MediaRecorder API for output generation.

This tool also allowed us to generate quick before and after samples that could be used for marketing, customer support and instances where we had an opportunity to showcase how the technology we are building could improve eaxisting audio recordings.

Stem Mixing

As we continued to evolve the product, we were experimenting with different ways to enable users to direct or control how the audio is enhanced. One of the workflows involved allowing the user to specific how much of the background they desire in the enhanced audio. Typically building a workflow like this would require upfront engineering work to prototype the experience to determine how to surface these controls, what good defaults we could lead with and what's the right experience to meet the user's needs. Using Claude I was able to build a mixing tool that took the isolated tracks and allowed adjusting the volume to evaluate what the right defaults should be and which of the isolated tracks are key to influencing the output.

Additionally, I added presets into this tool that allowed us to save these slider configurations as we learned that different recordings in different scenarios often benefited from a different mix of background and reverb. Again, building these was easy with Claude enabling early prototyping of the concepts and possible solutions with minimum engineering investment.

Speech-to-Text Comparison

One of the most time-consuming steps in model evaluation was reviewing long speech-to-text transcripts often looking for small details and checking for accuracy while listening to the original audio. Traditional PDF tools are not optimized for reviewing specific types of content so it was much harder to search for specific terms or content types that we were specifically evaluating.

Using Claude, I built a transcript comparison tool to surface diffs that also focused on very specific content types like abbreviations, filler words, repeated words, and numeric data. This made it easy to review model iterations where we specifically want to evaluate how a model would transcribe specific words or sentences. An example of this was in one version of the model, if a person repeated a word, the model would skip transcribing it twice resulting in misplaced timecodes. Similarly, this tool made it easy to evaluate how a model transcribed numeric content. An example of this was tracking how the model evolved from transcribing the phrase "Seven bucks" from $7 to literally 7 bucks and determining which was ideal for the use case we were solving.