Date post: | 14-Feb-2017 |
Category: |
Internet |
Upload: | webarchive-of-national-library-of-the-czech-republic |
View: | 268 times |
Download: | 3 times |
Webarchiv.czDovětek k přednášce o běhu památníku českého webu.
2266 domén
Docker?
Monitrix
https://github.com/ukwa/monitrix
Prototyp 1
Monitoring / Front-end pro Heritrix 3
Analytika probíhající sklizně / pravděpodobně agreguje jen jeden stroj
Prototyp 2
ELK: ElasticSearch / Logstash / Kibana
25 miliónů řádek logů / 26 GB na disku / 4vCPU / 20 GB RAM – otázka jak škálovat na celoplošné sklizně
QA
proces na analýzu reportu na nesklizené weby a jejich znovu sklizení
proces pro analýzu objevených ale nesklizených URL
na kontrolu sklizní speciální webů jako Youtube, Facebook, Twitter
Webarchiv.czKam směřovat?
Služby
CDX SERVER API
CDX SERVER API
http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=2&filter=!statuscode:200 will return 2 capture results with non-200 status codes.
http://web.archive.org/cdx/search/cdx?url=archive.org&output=json&limit=10&filter=!statuscode:200&filter=!mimetype:text/html&filter=digest:2WAXX5NUWNNCS2BDKCO5OVDQBJVNKIVV will return 10 capture results with non-200 status codes and mime types that are not text/html but which match a specific content digest
https://github.com/iipc/openwayback/tree/master/wayback-cdx-server-webapp
WAT
>>data['Envelope']['WARC-Header-Metadata']['WARC-Type']"response">>data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['Headers']['Server']"Apache">>data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Head']['Title']"BBCNEWS|Africa|NamibiabracesforNujomaexit">>len(data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Links'])42>>data['Envelope']['Payload-Metadata']['HTTP-Response-Metadata']['HTML-Metadata']['Links'][28]{"path":"A@/href","title":"HomeofBBCSportontheinternet","url":"http://news.bbc.co.uk/sport1/hi/default.stm"}
WAT
Použití https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Metadata+File+Specification
WAT specifikace https://webarchive.jira.com/wiki/display/Iresearch/Web+Archive+Transformation+(WAT)+Specification,+Utilities,+and+Usage+Overview
Workshop na vytvoření grafu pomocí WAT https://home.archive.org/~vinay/archive-web-graphs-workshop/
Common Crawl
Je možné použít Amazon infrastructure na analytiku nad daty Common Crawl
více jak ~100 TB přírůstek měsíčně
Common Crawhttps://commoncrawl.org/the-data/get-started/
Příklady využití dat Common Crawlhttp://commoncrawl.org/the-data/examples/
CDX Server API s GUI pro procházení CDX souborůhttp://index.commoncrawl.org
Fulltext
Portugalský prototyp fulltextu
http://www.arquivo.pt/resawdev
The login is: resaw/resaw.eu
https://sobre.arquivo.pt/news/a-first-attempt-to-archive-the-.eu-domain?set_language=en
https://netpreserveblog.wordpress.com/2015/06/03/a-first-attempt-to-archive-the-eu-domain/
Thesis http://sobre.arquivo.pt/sobre/publicacoes-1/Documentos-acerca-do-Arquivo.pt/information-search-in-web-archives
Slides from IIPC GA 2015 http://www.netpreserve.org/sites/default/files/attachments/2015_IIPC-GA_Slides_11_Gomes.pptx
kolegovy poznámky: https://www.evernote.com/shard/s43/sh/e6e12603-ecb2-42ae-8532-67d2779b4a86/3b2162e0bcc710d847b6fa5e86cc70b2
UK WA prototyp fulltextu Shine
Prototyphttps://www.webarchive.org.uk/shine/search/advanced
Wikihttps://github.com/ukwa/shine/wiki/Specification
Codehttps://github.com/ukwa/shine
Prezentace Helen Hockx-Yuhttp://www.netpreserve.org/sites/default/files/attachments/2015_IIPC-GA_Slides_08_Hockx.ppt
Videohttps://www.youtube.com/watch?v=o4iIdZP4rg8
Další příklady
Website Classification Dataset
http://data.webarchive.org.uk/opendata/ukwa.ds.1/classification/
HTTP Archive
In addition to the content of web pages, it's important to record how this digitized content is constructed and served. The HTTP Archive provides this record. It is a permanent repository of web performance information such as size of pages, failed requests, and technologies utilized. This performance information allows us to see trends in how the Web is built and provides a common data set from which to conduct web performance research.
http://httparchive.org/trends.php?s=All&minlabel=Nov+15+2010&maxlabel=Sep+15+2015
http://httparchive.org/interesting.php
Přednášky o současném myšlení o webových
archivech ze Stanfordu
IIPC GA 2015
https://www.youtube.com/channel/UCkUsw2Lo1ahekgy_xEb11BA/videos