Tomáš Cícha - Machine Learning Solutions at

Post on 22-Jan-2018

275 views 0 download

transcript

Tomáš Cícha, Lukáš Vrábel, Vít ListíkML Prague April 24, 2016

Machine Learning Solutionsat Seznam.cz

www.seznam.cz

Seznam.cz

● Czech web portal with wide range of services

● Seznam = list, directory

● Founded in 1996

● More than 1100 employees

● Revenue ~ 100 million €

● 2.4 million people visit Seznam.cz every day

www.seznam.cz

Seznam.cz Fulltext

● Web-scale search engine

● Since 2005

● 3 billion crawled web pages

● 600 million pages in search index

● 500 queries per second

● Strong local competitor for Google

● Not only fulltext – search in images & videos

www.seznam.cz

Search engine architecture

The Internet

UserQuery

www.seznam.cz

Search engine architecture

SearchIndex

The Internet

UserQuery

www.seznam.cz

Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

www.seznam.cz

Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?

Update the document?

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Should we index it?

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Should we index it?

Under which words?

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Should we index it?Under which words?

What to really search for?

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Should we index it?Under which words?

What to really search for?

What kind of results we want?

www.seznam.cz

ML in Search engine architecture

SearchIndex

The Internet

UserQuery

DocumentDatabase

Downloader

Indexer

Searcher

QueryProcessor

Download the document?Update the document?

What are its features?Which text is important?

Should we index it?Under which words?

What to really search for?Requirements for the result?

Which results are good?

Tomáš CíchaML Prague April 24, 2016

rc-rank

www.seznam.cz

rc-rank

● Supervised ML algorithm● Researched & developed at Seznam.cz

since 2012● Used widely throughout our search engine

– Relevancy of search results

– Webpage classifications

– Query classifications

– Query typocorrection

www.seznam.cz

rc-rank

● Fast model evaluation● Many hyperparameters to fine-tune the

learning process● Adaptable to wide range of problems and

training datasets● Able to compete with state-of-the-art

– See Marek Modrý: Learning to Rank Algorithms(http://cyberold.felk.cvut.cz/research/theses/papers/471.pdf)

www.seznam.cz

The Big picture

Label data

Collect features

Train a model

Deploy in testenvironment

Deploy

Evaluate

Profit!

www.seznam.cz

The Big picture

Label data

Collect features

Train a model

Deploy in testenvironment

Deploy

Evaluate

Profit!

?Design features

www.seznam.cz

The Big picture

Label data

Collect features

Train a model

Deploy in testenvironment

Deploy

Evaluate

Profit!

?Design features

Modify rc-rank

www.seznam.cz

Learning to Rank

● An Information Retrieval problem● Given a query...● … and a collection of possible answers● Sort the answers according to their

relevancy to the query● That effectively means – give them ranks

www.seznam.cz

The guessing game

www.seznam.cz

The rc-rank model

n conditions

2n results

www.seznam.cz

The rc-rank model

2n results

Object

n conditions

www.seznam.cz

The rc-rank model

?

2n results

Object

1

n conditions

www.seznam.cz

The rc-rank model

? ?

2n results

Object

1 0

n conditions

www.seznam.cz

The rc-rank model

? ? ? ?

2n results

Object

1 1 10 = 11

n conditions

www.seznam.cz

The rc-rank model

? ? ? ?

2n results

Object

1 1 10 = 11

n conditions

www.seznam.cz

The rc-rank model

x[2] < 12

n conditions

2n results

0.5206

2.6500

-5.9-1

6.661 1 10 = 11

x[0] < 0 x[5] < 2.4 x[7] < -0.5

0.5 2 2.6 -1 8.3 21 21 6.3 -23 0.6 51 42 41 35.2 28.1 0.02

www.seznam.cz

The rc-rank model

x[2] < 12

x[0] < 0

x[7] < -0.5

x[5] < 2.4

1

1

1

0

0.5 2 2.6 -1 8.3 21 21 6.3 -23 0.6 51 42 41 35.2 28.1 0.02

www.seznam.cz

Oblivious decision tree

x[2] < 12

x[0] < 0

x[7] < -0.5

x[5] < 2.4

1

1

1

0

0.5 2 2.6 -1 8.3 21 21 6.3 -23 0.6 51 42 41 35.2 28.1 0.02

www.seznam.cz

The rc-rank model

● Ensemble of oblivious decision trees● Final result = sum of results of individual

trees● Many optimization options● Models are easy to combine

www.seznam.cz

Training a tree

● Training examples = feature vectors

...... ...... ...... ...... ...... ...... ...... ...... linksCount

textRelevancy

documentAge

urlLength

queryInDomain...

Documents for query “ml prague”

www.seznam.cz

Training a tree

● Training examples have assigned “labels”

...

1

...

3

...

1

...

4

...

5

...

2

...

1

...

3

...

3

...

2

...

3

...

2

...

4

...

1

...

2

...

1

www.seznam.cz

Training a tree

? ?

13 14 52 13 32 32 41 21

www.seznam.cz

Training a tree

? ?

13 14 52 13

32 32 41 21

LinksCount < 1

13 14 52 13 32 32 41 21

www.seznam.cz

Training a tree

2.38 0

13 14 52 13

32 32 41 21

LinksCount < 1

13 14 52 13 32 32 41 21

average

www.seznam.cz

Training a tree

? ?

14 51 31 31 21 23 34 22

13 14 52 13 32 32 41 21

DocumentAge < 14

www.seznam.cz

Training a tree

2.38 2.38

14 51 31 31 21 23 34 22

13 14 52 13 32 32 41 21

DocumentAge < 14

www.seznam.cz

Training a tree

2.14 4.0

13 21 35 21

32 13 21

44

TextRelevancy < 8.5

13 14 52 13 32 32 41 21

www.seznam.cz

Training a tree

1.44 3.57

21 21 21 11 32 54 33 34

TextRelevancy < 10.2

13 14 52 13 32 32 41 21

www.seznam.cz

Training a tree

1.44 3.57

TextRelevancy < 10.2

13 14 52 13 32 32 41 21

1.441.44 1.441.44 1.441.44 1.441.44 3.571.44 3.573.57 3.573.57 3.573.57

Variance = 1.113

rank

www.seznam.cz

Training a tree

1.44 3.57

11 11 21 22 32 33 43 54

1.441.44 1.441.44 1.441.44 1.441.44 3.571.44 3.573.57 3.573.57 3.573.57

label-

rank=

error-0.4-0.4 -0.4-0.4 0.56-0.4 0.560.56 0.56 -0.6 -0.6-0.6 0.43-0.6 1.430.43

Mean Square Error = 0.372

www.seznam.cz

Variance vs MSE

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

msevariance

www.seznam.cz

Variance vs MSE

● Easier to compute● Naturally defines “strength” of a split● Has a “worst possible” value

www.seznam.cz

Training a tree

11 11 21 22 32 33

4

3

54

1.0 2.0 3.0 4.33

11 11 1 2 22 32 33 3

4 54

TextRelevancy < 10.2

Condition?

www.seznam.cz

Growing a tree

● Greedy algorithm – try all splits generated by available conditions

www.seznam.cz

Growing a tree

● Greedy algorithm – try all splits generated by available conditions

● Find a sequence of splits which maximizes variance of the resulting rankings of training data

www.seznam.cz

Growing a forest

● Train a sequence of trees● Supply new “labels” for each tree

www.seznam.cz

Boosted regression

1

3

1

4

5

2

1

3

3

2

2

4

1

3

3

1

2

4

2

1

4

1

1

4

4

1

-2

0

0

2

1

0

0

-1

2

1

-2

0

0- =

www.seznam.cz

Boosted regression

● Each tree is trained to estimate the remaining error of its predecessors

● Forest can be “cut” at any point● Very easy to trade evaluation speed for

model complexity● Overfitting is not an imminent problem

www.seznam.cz

The nonimportance of numbers

1

3

1

4

5

2

1

3

3

2

2

4

1

4

5

3

4

3

3

2

2

1

2

1

1

1

sort

4

5

3

4

3

3

2

2

1

2

1

1

1

www.seznam.cz

The nonimportance of numbers

1

3

1

4

5

2

1

3

3

2

2

4

1

4

5

3

4

3

3

2

2

1

2

1

1

1

sort

256

500

111

200

6

10

4

4

0

2

-6

-15

-7

www.seznam.cz

The nonimportance of numbers

1

3

1

4

5

2

1

3

3

2

2

4

1

4

5

3

4

3

3

2

2

1

2

1

1

1

sort

256

500

111

200

6

10

4

4

0

2

-6

-15

-7

Ordnung muss sein.

www.seznam.cz

Lambda learning

● Inspired by LambdaMART (Microsoft Research)

● List-wise approach● Solves the Learning to Rank problem● Subsequent trees are trained to improve

the ordering of training examples● This is done per query!

www.seznam.cz

Lambda learning

1

3

1

4

5

2

1

3

3

2

2

4

1

3

3

1

2

4

2

1

4

1

1

4

4

1

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

Swapping these twois a good idea +1

-1

The “lambdas”

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

Even better+1

-1 -5

+5

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

+1

-1 -5

+5Lets make sure theyare far enough from

each other

+2

-2

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

+1

-1 -5

+5Please, switch

+2

-2

-3

+3

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

+1

-1 -5

+5

I’m OK with that

+2

-2

-3

+3

+0

-0

www.seznam.cz

Lambda learning

3

2

5

4

3

1

4

2

1

1

1

3

2

4

4

4

4

3

3

2

2

1

1

1

1

1

+1

-1 -5

+5

Also irrelevant

+2

-2

-3

+3

+0

-0

+0

+0

www.seznam.cz

Lambda learning

● Compute lambdas for every pair● Sum them up● Use them as targets in the next tree● Recompute the ranks● Sort the examples● Repeat

www.seznam.cz

Lambda learning

● Lambdas can be computed in a way that the algorithm will optimize given performance measure (ERR, nDCG...)

www.seznam.cz