Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký

Dolování dat z řeči pro bezpečnostní aplikace

Honza Černocký

BUT Speech@FIT, FIT VUT v Brně

Security Session, 11.4.2015

Security Session Honza Cernocky 11/4/2015 2/36

Agenda

• Introduction• Gender ID example • Speech recognition • Language identification • Speaker recognition • Conclusions

3/36

Needle in a haystack

• Speech is the most important modality of human-human communication (~80% of information) … criminals and terrorists are also communicating by speech

• Speech is easy to acquire in the scenarios of interest.• More difficult is to f ind what we are looking for• Typically done by human experts, but always count on:

• Limited personnel• Limited budget• Not enough languages spoken • Insufficient security clearances

Technologies of speech processing are not almighty but can help to narrow the search space.

Security Session Honza Cernocky 11/4/2015


Data mining from spontaneous unprepared speech

Speaker/VoiceRecognition

GenderRecognition

LanguageRecognition

Who speaks?

What gender?

What language?

John Doe

Male or Female

English/German/??

audio (speech)Speech

Recognition What was said?“Hello John!”

“John” spotted

Time/relationanalysis

Who asked to whom?

John asked Paul


How do we work ?

• According to recipes from pattern recognition text-books !

Collect data

Choose features

Choose model

Train model

Evaluate the classifier

A priori knowledgeof the problem

deployment

Happy (or deadline passed) ?

Unhappy?


The result

Feature extraction Evaluation of

probabilities or likelihoods

Models

“Decoding”input decision

7/36

The simplest example … GID

Gender Identif ication • Tag speech segments as male or

female.

Security Session Honza Cernocky 11/4/2015


So how is Gender-ID done ?

Evaluation of GMM

likelihoodsMFCC

input

Gaussian Mixture models – boys,

girls

DecisionMale/female


Features – Mel Frequency Cepstral Coefficients

• The signal is not stationary

• And the hearing is not linear


Features – a vector each 10ms


The evaluation of likelihoods: GMM


Decision - „decoding“

Gender ID summary

Needed data: •Several hours of speech (from the target channels) labeled as M or F.

Accuracy: •the most accurate of our speech data mining tools: >96% accuracy on challenging channels

What do we get: •Limiting the search space by 50%



Speech recognition

• Voice2text (V2T), Speech2text (S2T), transcription … • Large vocabulary continuous speech recognition

(LVCSR)

Feature extraction Evaluation of

likelihoods (scores of hypothesis)

Acoustic models

“Decoding”speech text

Language model Pronunciation dictionary

Recognition network

LVCSR technically …

• Acoustic models • … how do speech segments match basic speech unites

(phonemes) • trained on large (>100h) quantities of carefully transcribed

speech data • Classically Gaussian Mixture models

• Language models • … how do the words follow each other

President George Bush President George push

• Need to be trained on large quantities (Gigabytes) of text from the target domain

• Pronunciation dictionary• Translate words into phonemes: dog d oh g• Basis needs to be created by hand, the rest generated using

trained grapheme to phoneme (g2p) converter• A toolkit to do all this … HTK, KALDI, proprietary.



Making LVCSR work well

• Neural networks• Eating up other techniques (feature extraction, scoring,

LM) - DNNs • Bottle-neck NNs.

• Speaker adaptation • Asking the speaker to read a text in dictation systems …• Unsupervised needed !• MAP, MLLR, CMLLR, RDLT, SAT …


Challenges in LVCSR

• LVCSR relatively mature in well represented languages (US English, Modern Standard Arabic, Czech)

• Fast development of recognizers for new languages with limited resources – IARBA BABEL project • Limited language packs 10h + some 70h of untranscribed

data • 2013 languages: Cantonese, Turkish, Pashto, Tagalog,

Surprise - Vietnamese• 2014 languages: Bengali, Assamese, Zulu, Haiti Creole, Lao,

Surprise: Tamil• How to re-use resources

from other languages ? • How to adapt to user’s

language/domain without seeing his/her data ?


Some examples ….

and then they have one week to retrain their keyword results ... and ... give you might ask why one we there a lot of research or evaluation methods ... the people are trying out what keywords or so it is important to leave a ... sufficient amount of time there as well ...

uhuh kade sengifowunelwe nguThami manje ithi angazi e- ekhuluma nomunye ubhuti wakwamasipala ukuthi ene usho ukuthi kunabantu ekufanele baphelelwe ngumsebenzi ngoba uNomvula emecabanga uzokhokha (()) ngoba yena uzoy ithela uzoyi uzoyihlulisela ngoba phela kukhona aba- abaphethe u-Adam angithi

LVCSR – what to expect

Accuracies (word accuracy) •Dictation: >90% •Reasonable languages: >70%•Babel languages ~70% WER (example on Tamil)

Is this OK ?? •Usually not useable for direct reading, and questionable, if a trained secretary is not faster in case we need 100% accurate output.•Yes useable for search, for rare languages often the only alternative.


LVCSR – user data

• Speech (for acoustic models): • Many hours of data as close as possible to the target use

(language, dialect, speaking style …) • Needs to be transcribed better than in TV subtitles.

• Text (for language models) • Newspapers and TV news work for dictation but not here. • Need target text data (including very dirty language)• Can be simulated by looking for dirty Internet data (Twitter,

discussion forums). • Pronunciations: generally not a big deal, needs list of words.

Problematic for languages without expertise. • Privacy issues:

• Speech and text are sensitive. • Re-training of LVCSR by the users so far not successful. • Work on modularization: collection of statistics by the user,

shipping to development teams…• Opportunity to collect this data jointly, especially for

languages relevant for security across Europe



Language identification

• Which language in the recording ?

LID


Standard approaches

• Acoustics

• Phonotactics


LID: Current state-of-the-art system

• A large GMM (“Universal Background model - UBM”) – performs collection of sufficient statistics – a vector of several thousands of parameters per utterance (fixed size!)

• Projection to a “language print” – several hundreds of values.

• These language prints are scored and score is calibrated.

LID – what to expect

• Performance on nice dataNIST LRE 2009, 23 languages


0%

2%

4%

6%

8%

10%

30s 10s 3s

Best 1

Best 2

Best 3

Best 4

Best 5

Phase3

Phase2

Phase1

17

• And on terrible data RATS 2014, 5 languages (EER)


LID – user data

• Tens of hours of data per target language or dialect• Need to have only the language label, no transcription necessary. • Allow to:

• Improve the model of an existing language. • Add a new language or dialect, or even a target group

• LID is a technology where the user can modify the system him/her-self

• Language prints do not carry the information on the content –potential for cooperation

• Backup solution:• automatic acquisition of language-specific telephone data from public

sources (EOARD project)


Speaker recognition

Two hypotheses• H0: the speaker in test recording IS THE SAME WE

SAW IN THE ENROLMENT• H1: the speaker in test recording IS DIFFERENT

• Log likelihood ratio

SRE classical scheme

• Feature extraction – Mel Frequency Cepstral Coefficients

• Background model implemented as a Gaussian Mixture model

• Adapted to the target speaker. • At the time of the test, both models produce likelihoods

that are subtracted and thresholded.Such a system• Can be built by a reasonably skilled student equipped

with Matlab in half a day • Will reasonably function in case enrollment and test

take place under similar conditions.


IKR !

Inter-session variability

NOT HAVING THE SAME CONDITIONS !

Intrinsic variabil i ty•Language•Emotions, stress, Lombard effect•Health condition •Content of the message

Extrinsic variabil i ty•Noise •Transmission channel •Codec (or series of codecs) •Recording device …



Years of SRE R&D fighting the variability …

Front-end processing

Front-end processing

Target modelTarget model

Background model

Background model

LR scorenormalization

LR scorenormalization

Σ ΛAdapt

Feature domain Model domain Score domain

• Noise removal

• Tone removal

• Cepstral mean subtraction

• RASTA filtering

• Mean & variance normalization

• Feature warping

• Speaker Model Synthesis

• Eigenchannel compensation

•Joint Factor Analysis

• Nuisance Attribute Projection

• Z-norm

• T-norm

• ZT-norm

•Feature Mapping

•Eigenchannel adaptation in feature domain


Current state-of-the-art

• Low-dimensional representation of whole recordings • i-Vectors (for R&D), Voiceprints (for business)

• Allows for very fast scoring.


What to expect I.

• Works very nicely for long telephone recordings (EER ~2%) – multiple successes in NIST evaluations.

• Examples …


What to expect II.

• Noise, varying communication channels, short recordings (10s) still a problem – DARPA RATS program

• Examples …

SRE – user data

• The performance of the SRE system crucially depends on how the training data is close to the deployment.

• UBM – needs lots (100s of hours) of unannotated data, not very sensitive.

• VoicePrint extractor – dtto. • Scoring done by PLDA

• Voice-prints with speaker labels (A,B,C, …) needed• Even 50 speakers help to increase the accuracy by 30%. • … but some users are not able to collect/label even this

amount. • Work running on unsupervised adaptation on

unannotated data.


The charm of voice-prints

• Allowing for transfer of speaker identities• without giving out the original WAV• Without possibility to reconstruct what was said.


No contentcontent

• Opening a range of opportunities for• Cooperation between customers and law enforcement• Cooperation with R&D teams.

Conclusions

• Speech data mining technologies are already serving in security and defense (and you can test and eventually buy the ones from several vendors)

• International crime asks for international reaction: Standardization (even in the form of informal working draft) should take place ASAP to allow Police forces to exchange voice-prints regardless of vendors.… we’re on it.



Díky za pozvání na Security Session !

Otázky ?

BACKUP SLIDES



Who am I • MS. in Radioelectronics from BUT 1993. • PhD. in Signal processing jointly from Universite d’Orsay

(France) and BUT• Started speech coding in 1992 and stayed in speech processing

since• was with Oregon Graduate Institute (Portland, OR) in the group

of Prof. Hermansky in 2001• Since 2002 at the Faculty of Information Technology of BUT,

habilitation to Associate Professor (Doc.) in 2003.• Executive leader of BUT Speech@FIT research group • Since 2008 Head of Department of Computer Graphics and

Multimedia


BUT Speech@FIT• Founded in 1997 (1 person)• ~20 people in 2013 (faculty, researchers, grad and pre-grad

students, support staff)• Active in all technologies this presentation is about• Supported by EU,

local and US (DARPA and IARPA) grants

International cooperation and standardization

• NIST evaluation campaigns• Allowing for objective comparison of technologies • Often on too good data.

• US-funded projects • Realistic testing on noisy channels (DARPA RATS) and new

languages (IARPA Babel) • Restricted to participants

• EU projects examples • Past: MOBIO EU FP7 (mobile biometry) helped and fast speaker

recognition based on low-dimensional voice-prints. • SIIP – addressing topic SEC-2013.5.1-2 Audio and voice analysis,

speaker identification for security applications – Integration Project - starting now.

Standardization – not much … • UK Home Office Forensic Speech and Audio (FSA) Group - Bring

forensic speech and audio under the regulation of ISO 17025• ANSI/NIST-ITL Standard 1-2013, Data Format for

InterchangeRecord Type-11: Forensic and investigatory voice record


Date post:	16-Jul-2015
Category:	Technology
Upload:	security-session
View:	60 times
Download:	4 times