+ All Categories
Home > Technology > Tomáš Cícha - Machine Learning Solutions at

Tomáš Cícha - Machine Learning Solutions at

Date post: 22-Jan-2018
Category:
Upload: machine-learning-prague
View: 275 times
Download: 0 times
Share this document with a friend
68
Tomáš Cícha, Lukáš Vrábel, Vít Listík ML Prague April 24, 2016 Machine Learning Solutions at Seznam.cz
Transcript
Page 1: Tomáš Cícha - Machine Learning Solutions at

Tomáš Cícha, Lukáš Vrábel, Vít ListíkML Prague April 24, 2016

Machine Learning Solutionsat Seznam.cz

Page 2: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Seznam.cz

● Czech web portal with wide range of services

● Seznam = list, directory

● Founded in 1996

● More than 1100 employees

● Revenue ~ 100 million €

● 2.4 million people visit Seznam.cz every day

Page 3: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Seznam.cz Fulltext

● Web-scale search engine

● Since 2005

● 3 billion crawled web pages

● 600 million pages in search index

● 500 queries per second

● Strong local competitor for Google

● Not only fulltext – search in images & videos

Page 4: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Search engine architecture

The Internet

UserQuery

Page 5: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Search engine architecture

SearchIndex

The Internet

UserQuery

Page 6: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Page 7: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Page 8: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?

Page 9: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?

Update the document?

Page 10: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?

Page 11: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Page 12: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Should we index it?

Page 13: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Should we index it?

Under which words?

Page 14: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Should we index it?Under which words?

What to really search for?

Page 15: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Should we index it?Under which words?

What to really search for?

What kind of results we want?

Page 16: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Should we index it?Under which words?

What to really search for?Requirements for the result?

Which results are good?

Page 17: Tomáš Cícha - Machine Learning Solutions at

Tomáš CíchaML Prague April 24, 2016

rc-rank

Page 18: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

rc-rank

● Supervised ML algorithm● Researched & developed at Seznam.cz

since 2012● Used widely throughout our search engine

– Relevancy of search results

– Webpage classifications

– Query classifications

– Query typocorrection

Page 19: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

rc-rank

● Fast model evaluation● Many hyperparameters to fine-tune the

learning process● Adaptable to wide range of problems and

training datasets● Able to compete with state-of-the-art

– See Marek Modrý: Learning to Rank Algorithms(http://cyberold.felk.cvut.cz/research/theses/papers/471.pdf)

Page 20: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The Big picture

Label data

Collect features

Train a model

Deploy in testenvironment

Deploy

Evaluate

Profit!

Page 21: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The Big picture

Label data

Collect features

Train a model

Deploy in testenvironment

Deploy

Evaluate

Profit!

?Design features

Page 22: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The Big picture

Label data

Collect features

Train a model

Deploy in testenvironment

Deploy

Evaluate

Profit!

?Design features

Modify rc-rank

Page 23: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Learning to Rank

● An Information Retrieval problem● Given a query...● … and a collection of possible answers● Sort the answers according to their

relevancy to the query● That effectively means – give them ranks

Page 24: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The guessing game

Page 25: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The rc-rank model

n conditions

2n results

Page 26: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The rc-rank model

2n results

Object

n conditions

Page 27: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The rc-rank model

?

2n results

Object

1

n conditions

Page 28: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The rc-rank model

? ?

2n results

Object

1 0

n conditions

Page 29: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The rc-rank model

? ? ? ?

2n results

Object

1 1 10 = 11

n conditions

Page 30: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The rc-rank model

? ? ? ?

2n results

Object

1 1 10 = 11

n conditions

Page 31: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The rc-rank model

x[2] < 12

n conditions

2n results

0.5206

2.6500

-5.9-1

6.661 1 10 = 11

x[0] < 0 x[5] < 2.4 x[7] < -0.5

0.5 2 2.6 -1 8.3 21 21 6.3 -23 0.6 51 42 41 35.2 28.1 0.02

Page 32: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The rc-rank model

x[2] < 12

x[0] < 0

x[7] < -0.5

x[5] < 2.4

1

1

1

0

0.5 2 2.6 -1 8.3 21 21 6.3 -23 0.6 51 42 41 35.2 28.1 0.02

Page 33: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Oblivious decision tree

x[2] < 12

x[0] < 0

x[7] < -0.5

x[5] < 2.4

1

1

1

0

0.5 2 2.6 -1 8.3 21 21 6.3 -23 0.6 51 42 41 35.2 28.1 0.02

Page 34: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The rc-rank model

● Ensemble of oblivious decision trees● Final result = sum of results of individual

trees● Many optimization options● Models are easy to combine

Page 35: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

● Training examples = feature vectors

...... ...... ...... ...... ...... ...... ...... ...... linksCount

textRelevancy

documentAge

urlLength

queryInDomain...

Documents for query “ml prague”

Page 36: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

● Training examples have assigned “labels”

...

1

...

3

...

1

...

4

...

5

...

2

...

1

...

3

...

3

...

2

...

3

...

2

...

4

...

1

...

2

...

1

Page 37: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

? ?

13 14 52 13 32 32 41 21

Page 38: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

? ?

13 14 52 13

32 32 41 21

LinksCount < 1

13 14 52 13 32 32 41 21

Page 39: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

2.38 0

13 14 52 13

32 32 41 21

LinksCount < 1

13 14 52 13 32 32 41 21

average

Page 40: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

? ?

14 51 31 31 21 23 34 22

13 14 52 13 32 32 41 21

DocumentAge < 14

Page 41: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

2.38 2.38

14 51 31 31 21 23 34 22

13 14 52 13 32 32 41 21

DocumentAge < 14

Page 42: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

2.14 4.0

13 21 35 21

32 13 21

44

TextRelevancy < 8.5

13 14 52 13 32 32 41 21

Page 43: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

1.44 3.57

21 21 21 11 32 54 33 34

TextRelevancy < 10.2

13 14 52 13 32 32 41 21

Page 44: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

1.44 3.57

TextRelevancy < 10.2

13 14 52 13 32 32 41 21

1.441.44 1.441.44 1.441.44 1.441.44 3.571.44 3.573.57 3.573.57 3.573.57

Variance = 1.113

rank

Page 45: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

1.44 3.57

11 11 21 22 32 33 43 54

1.441.44 1.441.44 1.441.44 1.441.44 3.571.44 3.573.57 3.573.57 3.573.57

label-

rank=

error-0.4-0.4 -0.4-0.4 0.56-0.4 0.560.56 0.56 -0.6 -0.6-0.6 0.43-0.6 1.430.43

Mean Square Error = 0.372

Page 46: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Variance vs MSE

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

msevariance

Page 47: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Variance vs MSE

● Easier to compute● Naturally defines “strength” of a split● Has a “worst possible” value

Page 48: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Training a tree

11 11 21 22 32 33

4

3

54

1.0 2.0 3.0 4.33

11 11 1 2 22 32 33 3

4 54

TextRelevancy < 10.2

Condition?

Page 49: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Growing a tree

● Greedy algorithm – try all splits generated by available conditions

Page 50: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Growing a tree

● Greedy algorithm – try all splits generated by available conditions

● Find a sequence of splits which maximizes variance of the resulting rankings of training data

Page 51: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Growing a forest

● Train a sequence of trees● Supply new “labels” for each tree

Page 52: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Boosted regression

1

3

1

4

5

2

1

3

3

2

2

4

1

3

3

1

2

4

2

1

4

1

1

4

4

1

-2

0

0

2

1

0

0

-1

2

1

-2

0

0- =

Page 53: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Boosted regression

● Each tree is trained to estimate the remaining error of its predecessors

● Forest can be “cut” at any point● Very easy to trade evaluation speed for

model complexity● Overfitting is not an imminent problem

Page 54: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The nonimportance of numbers

1

3

1

4

5

2

1

3

3

2

2

4

1

4

5

3

4

3

3

2

2

1

2

1

1

1

sort

4

5

3

4

3

3

2

2

1

2

1

1

1

Page 55: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The nonimportance of numbers

1

3

1

4

5

2

1

3

3

2

2

4

1

4

5

3

4

3

3

2

2

1

2

1

1

1

sort

256

500

111

200

6

10

4

4

0

2

-6

-15

-7

Page 56: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

The nonimportance of numbers

1

3

1

4

5

2

1

3

3

2

2

4

1

4

5

3

4

3

3

2

2

1

2

1

1

1

sort

256

500

111

200

6

10

4

4

0

2

-6

-15

-7

Ordnung muss sein.

Page 57: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

● Inspired by LambdaMART (Microsoft Research)

● List-wise approach● Solves the Learning to Rank problem● Subsequent trees are trained to improve

the ordering of training examples● This is done per query!

Page 58: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

1

3

1

4

5

2

1

3

3

2

2

4

1

3

3

1

2

4

2

1

4

1

1

4

4

1

Page 59: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

Page 60: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

Swapping these twois a good idea +1

-1

The “lambdas”

Page 61: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

Even better+1

-1 -5

+5

Page 62: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

+1

-1 -5

+5Lets make sure theyare far enough from

each other

+2

-2

Page 63: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

+1

-1 -5

+5Please, switch

+2

-2

-3

+3

Page 64: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

+1

-1 -5

+5

I’m OK with that

+2

-2

-3

+3

+0

-0

Page 65: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

+1

-1 -5

+5

Also irrelevant

+2

-2

-3

+3

+0

-0

+0

+0

Page 66: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

● Compute lambdas for every pair● Sum them up● Use them as targets in the next tree● Recompute the ranks● Sort the examples● Repeat

Page 67: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz

Lambda learning

● Lambdas can be computed in a way that the algorithm will optimize given performance measure (ERR, nDCG...)

Page 68: Tomáš Cícha - Machine Learning Solutions at

www.seznam.cz


Recommended