+ All Categories
Home > Technology > Vít Listík - Email.cz workshop

Vít Listík - Email.cz workshop

Date post: 11-Apr-2017
Category:
Upload: machine-learning-prague
View: 147 times
Download: 1 times
Share this document with a friend
28
Email.cz workshop Vit Listik @tivwvit
Transcript

Email.cz workshopVit Listik @tivwvit

Email stats● 60M emails per day● 3M users daily, 6M monthly● 2 PB

Email delivery process

Antispam● Fighting with the bad guys

Antispam sources● Content

○ Text○ Images○ Attachments○ Links○ Headers

● Metadata○ Traffic○ Historic data (reputation)○ Blacklists○ Rules (DKIM, DMARC, SPF)

Grey emailGraymail is solicited bulk email messages that don't fit the definition of email spam (e.g., the recipient "opted into" receiving them). Recipient interest in this type of mailing tends to diminish over time, increasing the likelihood that recipients will report graymail as spam. In some cases, graymail can account for up to 82 percent of the average user's email inbox.

Antispam stats again

ML in antispam● Topic● Usubscribe● Phishing● Domain keywords● Images● Personalized filter● Link naturalness

Examples

https://github.com/tivvit/ML-Prague-2016-email-workshop

Tools● Jupyter

○ Visualizations○ State

● HDF5● Pandas● Ipython cluster● Cluster storage

Let's go (to) Jupyter

Topic categorization● 16 categories● Manually labeled dataset● 2 languages (2 models)● 7th version● Overlapping classes

NLP● Bag of words● Lemmatization● Stop words

(1) John likes to watch movies. Mary likes movies too.(2) John also likes to watch football games.

[ "John", "likes", "to", "watch", "movies", "also", "football", "games", "Mary", "too"]

(1) [1, 2, 1, 1, 2, 0, 0, 0, 1, 1](2) [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]

SVM● Classification● Best split for classes● Linear classifier (kernels)

Multi class version

● One vs. all● Winner takes all

Topic categorization

Image categorization● Classes: spam x ham● Based on user reaction● Links analysis

● Low level image features○ Size○ DPI○ Hists○ Exif○ Compression

● Raw pixels

Spam roulette

User reactions● Noisy● Inconsistent● Bots● Low ratio

Image topics● Caffe● Pretrained network● Same classes as for words● Cleaned dataset of images from classified emails● 400k images● Slow on CPU

Loan non-bank Pharmacy DiscountEbola

Distributed learning● Spark● SparkNet (Caffe)● Elepheas (Keras)

Image types● Trivial

○ Animated○ Monitoring○ Border

● Photo● Graphics● Photo with graphics

Graphics

Photo

Image featuresExtraction

● PIL● OpenCV● Image Magick

Features (142)

● Channel stats○ Min, max, mean○ Standard deviation○ Skewness○ Entropy

Learning● Scipy - Decision Trees● Keras (Tensorflow, theano)

● 30k Manually labeled samples

Trees vs. neurons

Message● Gray email● Explore (visualize) your data (in Jupyter)● Use libraries● Simple subtasks (boosting) may help● Store intermediate results ● Store test results with the model


Recommended