SHARE
TWEET

Azure Speech Websocket Protocol

a guest Jan 9th, 2019 179 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  2. Step 1: Open a socket connection to wss://{region}.stt.speech.microsoft.com
  3. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  4.  
  5. The {region} is whatever region is associated with your key according to this table: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/regions
  6.  
  7. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  8. Step 2: Send a websocket connection request.
  9. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  10.  
  11. {locale} is a locale in "en-us" format. You can authenticate using either your raw access key or a token retrieved from cognitive services token endpoint https://docs.microsoft.com/en-us/azure/cognitive-services/authentication#authenticate-with-an-authentication-token (which I believe is the primary supported path). X-ConnectionId is a random guid.
  12.  
  13. GET /speech/recognition/interactive/cognitiveservices/v1?format=detailed&language={locale} HTTP/1.1\r\n
  14. Connection: Upgrade\r\n
  15. Upgrade: websocket\r\n
  16. {PICK ONE} Ocp-Apim-Subscription-Key: {API_auth_key}\r\n
  17. {PICK ONE} Authorization: Bearer {token}\r\n
  18. X-ConnectionId: 0c623356f7924352aee9612a531c4a19\r\n
  19. Sec-WebSocket-Key: {nonce}\r\n
  20. Sec-WebSocket-Version: 13\r\n
  21. Sec-WebSocket-Protocol: USP\r\n
  22. Host: {region}.stt.speech.microsoft.com\r\n
  23. \r\n
  24.  
  25. After this the server should send "HTTP/1.1 101 Switching Protocols" and a list of headers terminated by \r\n\r\n
  26.  
  27. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  28. Step 3: Send the context
  29. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  30.  
  31. Now using Websocket protocol https://tools.ietf.org/html/rfc6455#section-6.1, send a TEXT frame with this content:
  32. X-Timestamp:2018-12-25T11:33:12.534Z\r\n
  33. Path:speech.config\r\n
  34. Content-Type:application/json\r\n
  35. \r\n
  36. {"context":{"os":{"name":"Client","platform":"Windows","version":"8"},"system":{"build":"Windows-x64","lang":"C#","name":"SpeechSDK","version":"1.2.0"}}}
  37.  
  38. Uploaded packets after this point use BINARY frames and have a slightly altered encoding scheme. The payload is divided into headers and application data. The first 2 bytes of the payload indicates the length of the headers as an unsigned 16-bit value stored in big-endian (network byte order). These two bytes are followed by headers, and then the remaining data is application data.
  39.  
  40. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  41. Step 4: Send the RIFF header
  42. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  43.  
  44. Using the encoding scheme above, send this packet. {request ID} is a guid like 049eb5a35f894aed9b0b6edb4b573320 and remains constant for the rest of the session.
  45.  
  46. (2 byte header length field)
  47. (HEADERS:)
  48. X-Timestamp:2018-12-25T11:33:12.534Z\r\n
  49. Path:audio\r\n
  50. X-StreamId:1\r\n
  51. X-RequestId:{request ID}\r\n
  52. (PAYLOAD:)
  53. {RIFF data}
  54.  
  55. This is the RIFF data, which I believe corresponds to 16kHz mono PCM with a file length of 0:
  56. [ 0x52, 0x49, 0x46, 0x46, 0x00, 0x00, 0x00, 0x00, 0x57, 0x41, 0x56, 0x45, 0x66, 0x6D, 0x74, 0x20, 0x10, 0x00, 0x00, 0x00, 0x01, 0x00, 0x01, 0x00, 0x80, 0x3E, 0x00, 0x00, 0x00, 0x7D, 0x00, 0x00, 0x02, 0x00, 0x10, 0x00, 0x64, 0x61, 0x74, 0x61, 0x00, 0x00, 0x00, 0x00 ]
  57.  
  58. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  59. Step 5: Start sending audio
  60. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  61.  
  62. Each audio frame looks something like this. The audio itself must conform to the RIFF header you sent before: 16kHz mono PCM.
  63.  
  64. (2 byte header length field)
  65. (HEADERS:)
  66. X-Timestamp:2018-12-25T11:33:12.534Z\r\n
  67. Path:audio\r\n
  68. X-StreamId:1\r\n
  69. X-RequestId:{request ID}\r\n
  70. (PAYLOAD:)
  71. {Raw PCM data, the SDK sends 3200 bytes at a time but it's probably flexible}
  72.  
  73. When you want to close the stream you just send an audio frame with no audio data.
  74.  
  75. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  76. Step 6: Parse responses
  77. ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
  78.  
  79. Data should start coming back from the service at this point. The service sends a variety of messages, all of them encoded using the length-prefix encoding scheme used above.
  80.  
  81. Turn start:
  82.  
  83. (2 byte header length)
  84. (HEADERS:)
  85. X-RequestId:30eb92c0fde244988d585879189beba6\r\n
  86. Content-Type:application/json; charset=utf-8\r\n
  87. Path:turn.start\r\n
  88. (PAYLOAD:)
  89. {"context": {"serviceTag": "0f48c52ae3ed46f987a3204c5579a26f"}}
  90.  
  91. Speech start:
  92.  
  93. (2 byte header length)
  94. (HEADERS:)
  95. X-RequestId:30eb92c0fde244988d585879189beba6\r\n
  96. Content-Type:application/json; charset=utf-8\r\n
  97. Path:speech.startDetected\r\n
  98. (PAYLOAD:)
  99. {"Offset":6500000}
  100.  
  101. Speech hypothesis:
  102.  
  103. (2 byte header length)
  104. (HEADERS:)
  105. X-RequestId:30eb92c0fde244988d585879189beba6
  106. Content-Type:application/json; charset=utf-8
  107. Path:speech.hypothesis
  108. (PAYLOAD:)
  109. {"Text":"this is","Offset":6500000,"Duration":4700000} // these values appear to be C# Ticks, 10,000,000 per second
  110.  
  111. Speech end:
  112.  
  113. (2 byte header length)
  114. (HEADERS:)
  115. X-RequestId:30eb92c0fde244988d585879189beba6\r\n
  116. Content-Type:application/json; charset=utf-8\r\n
  117. Path:speech.endDetected\r\n
  118. (PAYLOAD:)
  119. {"Offset":27400000}
  120.  
  121. Final speech hypothesis:
  122.  
  123. (2 byte header length)
  124. (HEADERS:)
  125. X-RequestId:30eb92c0fde244988d585879189beba6\r\n
  126. Content-Type:application/json; charset=utf-8\r\n
  127. Path:speech.phrase\r\n
  128. (PAYLOAD:)
  129. {"RecognitionStatus":"Success","Offset":6500000,"Duration":20900000,"NBest":[{"Confidence":0.87446707487106323,"Lexical":"this is a test","ITN":"this is a test","MaskedITN":"this is a test","Display":"This is a test."}]}
  130.  
  131. Turn end:
  132.  
  133. (2 byte header length)
  134. (HEADERS:)
  135. X-RequestId:30eb92c0fde244988d585879189beba6\r\n
  136. Content-Type:application/json; charset=utf-8\r\n
  137. Path:turn.end\r\n
  138. (PAYLOAD:)
  139. {}
RAW Paste Data
We use cookies for various purposes including analytics. By continuing to use Pastebin, you agree to our use of cookies as described in the Cookies Policy. OK, I Understand
 
Top