Date post: | 11-Apr-2017 |
Category: |
Technology |
Upload: | machine-learning-prague |
View: | 147 times |
Download: | 1 times |
Antispam sources● Content
○ Text○ Images○ Attachments○ Links○ Headers
● Metadata○ Traffic○ Historic data (reputation)○ Blacklists○ Rules (DKIM, DMARC, SPF)
Grey emailGraymail is solicited bulk email messages that don't fit the definition of email spam (e.g., the recipient "opted into" receiving them). Recipient interest in this type of mailing tends to diminish over time, increasing the likelihood that recipients will report graymail as spam. In some cases, graymail can account for up to 82 percent of the average user's email inbox.
ML in antispam● Topic● Usubscribe● Phishing● Domain keywords● Images● Personalized filter● Link naturalness
Topic categorization● 16 categories● Manually labeled dataset● 2 languages (2 models)● 7th version● Overlapping classes
NLP● Bag of words● Lemmatization● Stop words
(1) John likes to watch movies. Mary likes movies too.(2) John also likes to watch football games.
[ "John", "likes", "to", "watch", "movies", "also", "football", "games", "Mary", "too"]
(1) [1, 2, 1, 1, 2, 0, 0, 0, 1, 1](2) [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
SVM● Classification● Best split for classes● Linear classifier (kernels)
Multi class version
● One vs. all● Winner takes all
Image categorization● Classes: spam x ham● Based on user reaction● Links analysis
● Low level image features○ Size○ DPI○ Hists○ Exif○ Compression
● Raw pixels
Image topics● Caffe● Pretrained network● Same classes as for words● Cleaned dataset of images from classified emails● 400k images● Slow on CPU
Loan non-bank Pharmacy DiscountEbola
Image types● Trivial
○ Animated○ Monitoring○ Border
● Photo● Graphics● Photo with graphics
Graphics
Photo
Image featuresExtraction
● PIL● OpenCV● Image Magick
Features (142)
● Channel stats○ Min, max, mean○ Standard deviation○ Skewness○ Entropy