Current speech models and processing are already fast-enough, and decent to the extent that they allow reasonable indexing of content, and who cares about false positives apart from the victims. It is already used in benign ways in terms of indexing audio libraries. The false positives will be very common because the accuracy is still laughable when you are reading the output, and is badly affected by noise and poor signal quality generally. There's a big difference between "decent" for indexing and decent for accurate transcription which is readable. But still, the assessment I think is about right is that it requires about a 2GHz core to be running on the task. The other problem being that less common languages do not have decent speech models. So, it's not ready for mass-transcription for some while yet. I suspect that to reduce the power required for indexing-transcription, there will need to be more progress in parallel processing/neural net processing. To get real accuracy will need more advances in understanding of semantics and a semblance of AI. Google's approach is to match against existing text on a likelihood basis, but it still leads to laughable guesses. Interesting, the laws covering legal intercept of speech are much "better" than with the mass data surveillance, to the extent that an individual warrant is needed. Of course, they'd probably argue some nonsense like they weren't really intercepting or "listening" because no people were involved so mass surveillance was fine. Just be careful to carefully enunciate any words similar to those flagged for nefarious purposes if you don't want to be on the end of a false positive!