Date post: | 22-Jan-2018 |
Category: |
Technology |
Upload: | machine-learning-prague |
View: | 275 times |
Download: | 0 times |
Tomáš Cícha, Lukáš Vrábel, Vít ListíkML Prague April 24, 2016
Machine Learning Solutionsat Seznam.cz
www.seznam.cz
Seznam.cz
● Czech web portal with wide range of services
● Seznam = list, directory
● Founded in 1996
● More than 1100 employees
● Revenue ~ 100 million €
● 2.4 million people visit Seznam.cz every day
www.seznam.cz
Seznam.cz Fulltext
● Web-scale search engine
● Since 2005
● 3 billion crawled web pages
● 600 million pages in search index
● 500 queries per second
● Strong local competitor for Google
● Not only fulltext – search in images & videos
www.seznam.cz
Search engine architecture
The Internet
UserQuery
www.seznam.cz
Search engine architecture
SearchIndex
The Internet
UserQuery
www.seznam.cz
Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
www.seznam.cz
Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
Searcher
QueryProcessor
www.seznam.cz
ML in Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
Searcher
QueryProcessor
Download the document?
www.seznam.cz
ML in Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
Searcher
QueryProcessor
Download the document?
Update the document?
www.seznam.cz
ML in Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
Searcher
QueryProcessor
Download the document?Update the document?
What are its features?
www.seznam.cz
ML in Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
Searcher
QueryProcessor
Download the document?Update the document?
What are its features?Which text is important?
www.seznam.cz
ML in Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
Searcher
QueryProcessor
Download the document?Update the document?
What are its features?Which text is important?
Should we index it?
www.seznam.cz
ML in Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
Searcher
QueryProcessor
Download the document?Update the document?
What are its features?Which text is important?
Should we index it?
Under which words?
www.seznam.cz
ML in Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
Searcher
QueryProcessor
Download the document?Update the document?
What are its features?Which text is important?
Should we index it?Under which words?
What to really search for?
www.seznam.cz
ML in Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
Searcher
QueryProcessor
Download the document?Update the document?
What are its features?Which text is important?
Should we index it?Under which words?
What to really search for?
What kind of results we want?
www.seznam.cz
ML in Search engine architecture
SearchIndex
The Internet
UserQuery
DocumentDatabase
Downloader
Indexer
Searcher
QueryProcessor
Download the document?Update the document?
What are its features?Which text is important?
Should we index it?Under which words?
What to really search for?Requirements for the result?
Which results are good?
Tomáš CíchaML Prague April 24, 2016
rc-rank
www.seznam.cz
rc-rank
● Supervised ML algorithm● Researched & developed at Seznam.cz
since 2012● Used widely throughout our search engine
– Relevancy of search results
– Webpage classifications
– Query classifications
– Query typocorrection
www.seznam.cz
rc-rank
● Fast model evaluation● Many hyperparameters to fine-tune the
learning process● Adaptable to wide range of problems and
training datasets● Able to compete with state-of-the-art
– See Marek Modrý: Learning to Rank Algorithms(http://cyberold.felk.cvut.cz/research/theses/papers/471.pdf)
www.seznam.cz
The Big picture
Label data
Collect features
Train a model
Deploy in testenvironment
Deploy
Evaluate
Profit!
www.seznam.cz
The Big picture
Label data
Collect features
Train a model
Deploy in testenvironment
Deploy
Evaluate
Profit!
?Design features
www.seznam.cz
The Big picture
Label data
Collect features
Train a model
Deploy in testenvironment
Deploy
Evaluate
Profit!
?Design features
Modify rc-rank
www.seznam.cz
Learning to Rank
● An Information Retrieval problem● Given a query...● … and a collection of possible answers● Sort the answers according to their
relevancy to the query● That effectively means – give them ranks
www.seznam.cz
The guessing game
www.seznam.cz
The rc-rank model
n conditions
2n results
www.seznam.cz
The rc-rank model
2n results
Object
n conditions
www.seznam.cz
The rc-rank model
?
2n results
Object
1
n conditions
www.seznam.cz
The rc-rank model
? ?
2n results
Object
1 0
n conditions
www.seznam.cz
The rc-rank model
? ? ? ?
2n results
Object
1 1 10 = 11
n conditions
www.seznam.cz
The rc-rank model
? ? ? ?
2n results
Object
1 1 10 = 11
n conditions
www.seznam.cz
The rc-rank model
x[2] < 12
n conditions
2n results
0.5206
2.6500
-5.9-1
6.661 1 10 = 11
x[0] < 0 x[5] < 2.4 x[7] < -0.5
0.5 2 2.6 -1 8.3 21 21 6.3 -23 0.6 51 42 41 35.2 28.1 0.02
www.seznam.cz
The rc-rank model
x[2] < 12
x[0] < 0
x[7] < -0.5
x[5] < 2.4
1
1
1
0
0.5 2 2.6 -1 8.3 21 21 6.3 -23 0.6 51 42 41 35.2 28.1 0.02
www.seznam.cz
Oblivious decision tree
x[2] < 12
x[0] < 0
x[7] < -0.5
x[5] < 2.4
1
1
1
0
0.5 2 2.6 -1 8.3 21 21 6.3 -23 0.6 51 42 41 35.2 28.1 0.02
www.seznam.cz
The rc-rank model
● Ensemble of oblivious decision trees● Final result = sum of results of individual
trees● Many optimization options● Models are easy to combine
www.seznam.cz
Training a tree
● Training examples = feature vectors
...... ...... ...... ...... ...... ...... ...... ...... linksCount
textRelevancy
documentAge
urlLength
queryInDomain...
Documents for query “ml prague”
www.seznam.cz
Training a tree
● Training examples have assigned “labels”
...
1
...
3
...
1
...
4
...
5
...
2
...
1
...
3
...
3
...
2
...
3
...
2
...
4
...
1
...
2
...
1
www.seznam.cz
Training a tree
? ?
13 14 52 13 32 32 41 21
www.seznam.cz
Training a tree
? ?
13 14 52 13
32 32 41 21
LinksCount < 1
13 14 52 13 32 32 41 21
www.seznam.cz
Training a tree
2.38 0
13 14 52 13
32 32 41 21
LinksCount < 1
13 14 52 13 32 32 41 21
average
www.seznam.cz
Training a tree
? ?
14 51 31 31 21 23 34 22
13 14 52 13 32 32 41 21
DocumentAge < 14
www.seznam.cz
Training a tree
2.38 2.38
14 51 31 31 21 23 34 22
13 14 52 13 32 32 41 21
DocumentAge < 14
www.seznam.cz
Training a tree
2.14 4.0
13 21 35 21
32 13 21
44
TextRelevancy < 8.5
13 14 52 13 32 32 41 21
www.seznam.cz
Training a tree
1.44 3.57
21 21 21 11 32 54 33 34
TextRelevancy < 10.2
13 14 52 13 32 32 41 21
www.seznam.cz
Training a tree
1.44 3.57
TextRelevancy < 10.2
13 14 52 13 32 32 41 21
1.441.44 1.441.44 1.441.44 1.441.44 3.571.44 3.573.57 3.573.57 3.573.57
Variance = 1.113
rank
www.seznam.cz
Training a tree
1.44 3.57
11 11 21 22 32 33 43 54
1.441.44 1.441.44 1.441.44 1.441.44 3.571.44 3.573.57 3.573.57 3.573.57
label-
rank=
error-0.4-0.4 -0.4-0.4 0.56-0.4 0.560.56 0.56 -0.6 -0.6-0.6 0.43-0.6 1.430.43
Mean Square Error = 0.372
www.seznam.cz
Variance vs MSE
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
msevariance
www.seznam.cz
Variance vs MSE
● Easier to compute● Naturally defines “strength” of a split● Has a “worst possible” value
www.seznam.cz
Training a tree
11 11 21 22 32 33
4
3
54
1.0 2.0 3.0 4.33
11 11 1 2 22 32 33 3
4 54
TextRelevancy < 10.2
Condition?
www.seznam.cz
Growing a tree
● Greedy algorithm – try all splits generated by available conditions
www.seznam.cz
Growing a tree
● Greedy algorithm – try all splits generated by available conditions
● Find a sequence of splits which maximizes variance of the resulting rankings of training data
www.seznam.cz
Growing a forest
● Train a sequence of trees● Supply new “labels” for each tree
www.seznam.cz
Boosted regression
1
3
1
4
5
2
1
3
3
2
2
4
1
3
3
1
2
4
2
1
4
1
1
4
4
1
-2
0
0
2
1
0
0
-1
2
1
-2
0
0- =
www.seznam.cz
Boosted regression
● Each tree is trained to estimate the remaining error of its predecessors
● Forest can be “cut” at any point● Very easy to trade evaluation speed for
model complexity● Overfitting is not an imminent problem
www.seznam.cz
The nonimportance of numbers
1
3
1
4
5
2
1
3
3
2
2
4
1
4
5
3
4
3
3
2
2
1
2
1
1
1
sort
4
5
3
4
3
3
2
2
1
2
1
1
1
www.seznam.cz
The nonimportance of numbers
1
3
1
4
5
2
1
3
3
2
2
4
1
4
5
3
4
3
3
2
2
1
2
1
1
1
sort
256
500
111
200
6
10
4
4
0
2
-6
-15
-7
www.seznam.cz
The nonimportance of numbers
1
3
1
4
5
2
1
3
3
2
2
4
1
4
5
3
4
3
3
2
2
1
2
1
1
1
sort
256
500
111
200
6
10
4
4
0
2
-6
-15
-7
Ordnung muss sein.
www.seznam.cz
Lambda learning
● Inspired by LambdaMART (Microsoft Research)
● List-wise approach● Solves the Learning to Rank problem● Subsequent trees are trained to improve
the ordering of training examples● This is done per query!
www.seznam.cz
Lambda learning
1
3
1
4
5
2
1
3
3
2
2
4
1
3
3
1
2
4
2
1
4
1
1
4
4
1
www.seznam.cz
Lambda learning
3
2
5
4
3
1
4
2
1
1
1
3
2
4
4
4
4
3
3
2
2
1
1
1
1
1
www.seznam.cz
Lambda learning
3
2
5
4
3
1
4
2
1
1
1
3
2
4
4
4
4
3
3
2
2
1
1
1
1
1
Swapping these twois a good idea +1
-1
The “lambdas”
www.seznam.cz
Lambda learning
3
2
5
4
3
1
4
2
1
1
1
3
2
4
4
4
4
3
3
2
2
1
1
1
1
1
Even better+1
-1 -5
+5
www.seznam.cz
Lambda learning
3
2
5
4
3
1
4
2
1
1
1
3
2
4
4
4
4
3
3
2
2
1
1
1
1
1
+1
-1 -5
+5Lets make sure theyare far enough from
each other
+2
-2
www.seznam.cz
Lambda learning
3
2
5
4
3
1
4
2
1
1
1
3
2
4
4
4
4
3
3
2
2
1
1
1
1
1
+1
-1 -5
+5Please, switch
+2
-2
-3
+3
www.seznam.cz
Lambda learning
3
2
5
4
3
1
4
2
1
1
1
3
2
4
4
4
4
3
3
2
2
1
1
1
1
1
+1
-1 -5
+5
I’m OK with that
+2
-2
-3
+3
+0
-0
www.seznam.cz
Lambda learning
3
2
5
4
3
1
4
2
1
1
1
3
2
4
4
4
4
3
3
2
2
1
1
1
1
1
+1
-1 -5
+5
Also irrelevant
+2
-2
-3
+3
+0
-0
+0
+0
www.seznam.cz
Lambda learning
● Compute lambdas for every pair● Sum them up● Use them as targets in the next tree● Recompute the ranks● Sort the examples● Repeat
www.seznam.cz
Lambda learning
● Lambdas can be computed in a way that the algorithm will optimize given performance measure (ERR, nDCG...)
www.seznam.cz