Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Piping AudioKit Microphone to Google Speech-to-Text

I'm trying to get AudioKit to pipe the microphone to Google's Speech-to-Text API as seen here but I'm not entirely sure how to go about it.

To prepare the audio for the Speech-to-Text engine, you need to set up the encoding and pass it through as chunks. In the example Google uses, they use Apple's AVFoundation, but I'd like to use AudioKit so I can preform some pre-processing such as cutting of low amplitudes etc.

I believe the right way to do this is to use a Tap:

First, I should match the format by:

var asbd = AudioStreamBasicDescription()
asbd.mSampleRate = 16000.0
asbd.mFormatID = kAudioFormatLinearPCM
asbd.mFormatFlags = kAudioFormatFlagIsSignedInteger | kAudioFormatFlagIsPacked
asbd.mBytesPerPacket = 2
asbd.mFramesPerPacket = 1
asbd.mBytesPerFrame = 2
asbd.mChannelsPerFrame = 1
asbd.mBitsPerChannel = 16

AudioKit.format = AVAudioFormat(streamDescription: &asbd)!

Then create a tap such as:

open class TestTap {
    internal let bufferSize: UInt32 = 1_024

    @objc public init(_ input: AKNode?) {
        input?.avAudioNode.installTap(onBus: 0, bufferSize: bufferSize, format: AudioKit.format) { buffer, _ in

         // do work here

        }
    }
}

But I wasn't able to identify the right way of handling this data to be sent to the Google Speech-to-Text API via the method streamAudioData in real-time with AudioKit but perhaps I am going about this the wrong way?

UPDATE:

I've created a Tap as such:

open class TestTap {

    internal var audioData =  NSMutableData()
    internal let bufferSize: UInt32 = 1_024

    func toData(buffer: AVAudioPCMBuffer) -> NSData {
        let channelCount = 2  // given PCMBuffer channel count is
        let channels = UnsafeBufferPointer(start: buffer.floatChannelData, count: channelCount)
        return NSData(bytes: channels[0], length:Int(buffer.frameCapacity * buffer.format.streamDescription.pointee.mBytesPerFrame))
    }

    @objc public init(_ input: AKNode?) {

        input?.avAudioNode.installTap(onBus: 0, bufferSize: bufferSize, format: AudioKit.format) { buffer, _ in
            self.audioData.append(self.toData(buffer: buffer) as Data)

            // We recommend sending samples in 100ms chunks (from Google)
            let chunkSize: Int /* bytes/chunk */ = Int(0.1 /* seconds/chunk */
                * AudioKit.format.sampleRate /* samples/second */
                * 2 /* bytes/sample */ )

            if self.audioData.length > chunkSize {
                SpeechRecognitionService
                    .sharedInstance
                    .streamAudioData(self.audioData,
                                     completion: { response, error in
                                        if let error = error {
                                            print("ERROR: \(error.localizedDescription)")
                                            SpeechRecognitionService.sharedInstance.stopStreaming()
                                        } else if let response = response {
                                            print(response)
                                        }
                    })
                self.audioData = NSMutableData()
            }

        }
    }
}

and in viewDidLoad:, I'm setting AudioKit up with:

AKSettings.sampleRate = 16_000
AKSettings.bufferLength = .shortest

However, Google complains with:

ERROR: Audio data is being streamed too fast. Please stream audio data approximately at real time.

I've tried changing multiple parameters such as the chunk size to no avail.

like image 581
barnabus Avatar asked Jan 18 '26 05:01

barnabus


1 Answers

I found the solution here.

Final code for my Tap is:

open class GoogleSpeechToTextStreamingTap {

internal var converter: AVAudioConverter!

@objc public init(_ input: AKNode?, sampleRate: Double = 16000.0) {

    let format = AVAudioFormat(commonFormat: AVAudioCommonFormat.pcmFormatInt16, sampleRate: sampleRate, channels: 1, interleaved: false)!

    self.converter = AVAudioConverter(from: AudioKit.format, to: format)
    self.converter?.sampleRateConverterAlgorithm = AVSampleRateConverterAlgorithm_Normal
    self.converter?.sampleRateConverterQuality = .max

    let sampleRateRatio = AKSettings.sampleRate / sampleRate
    let inputBufferSize = 4410 //  100ms of 44.1K = 4410 samples.

    input?.avAudioNode.installTap(onBus: 0, bufferSize: AVAudioFrameCount(inputBufferSize), format: nil) { buffer, time in

        let capacity = Int(Double(buffer.frameCapacity) / sampleRateRatio)
        let bufferPCM16 = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: AVAudioFrameCount(capacity))!

        var error: NSError? = nil
        self.converter?.convert(to: bufferPCM16, error: &error) { inNumPackets, outStatus in
            outStatus.pointee = AVAudioConverterInputStatus.haveData
            return buffer
        }

        let channel = UnsafeBufferPointer(start: bufferPCM16.int16ChannelData!, count: 1)
        let data = Data(bytes: channel[0], count: capacity * 2)

        SpeechRecognitionService
            .sharedInstance
            .streamAudioData(data,
                             completion: { response, error in
                                if let error = error {
                                    print("ERROR: \(error.localizedDescription)")
                                    SpeechRecognitionService.sharedInstance.stopStreaming()
                                } else if let response = response {
                                    print(response)
                                }
            })
    }
}
like image 71
barnabus Avatar answered Jan 19 '26 20:01

barnabus



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!