Web Archives are created through regular crawls performed by Archive-It, a service provided by the Internet Archives.
Archive-It performs regular crawls to create as complete a snapshot of the University's website as is feasible. In a given webcrawl, Archive-It begins from a seed (a high level domain, such as www.albany.edu) or set of seeds, and then automatically harvests successive layers of the website by following links from this seed. This process continues for a specified duration or until a specified number of documents has been harvested.
Series 1, the main UAlbany website, is crawled on a daily basis, with more comprehensive crawls performed monthly. The daily crawls are limited to approximately 1,000 documents whereas the monthly crawls harvest as much as possible in a 5 day period.
Series 2, the UAlbany NewsCenter website, is crawled once per week. Each crawl lasts no longer than 3 days.
Additional information on Archive-It can be found here.
Additional information on the Internet Archives can be found here.
crawl: 257751
Crawl Rules
Ignore Robots.txt for www.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for www.alumni.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for www.ualbanysports.com (last updated 2016-02-11)
Ignore Robots.txt for library.albany.edu (last updated 2017-05-19)
Ignore Robots.txt for alumni.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for asrc.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for atmos.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for bioinformatics.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for cela.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for choose.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for cs.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for csda.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for imls.ctg.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for www.ctg.albany.edu (last updated 2017-05-19)
Ignore Robots.txt for cwig.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for events.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for hr.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for ibl.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for illiad.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for liblogs.albany.edu (last updated 2017-05-19)
Ignore Robots.txt for libguides.library.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for scholarsarchive.library.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for listserv.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for m.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for math.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for omega.math.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for mumford.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for nyjm.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for pdp.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for resnet.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for rit.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for cyberphysics.rit.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for rna.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for slsc.albany.edu (last updated 2017-05-19)
Ignore Robots.txt for uaems.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for uapps.albany.edu (last updated 2016-02-11)
Ignore Robots.txt for wiki.albany.edu (last updated 2016-02-11)
Block host dev.library.albany.edu
Crawl Times
start_date: 2017-01-01T15:23:08Z
original_start_date: 2017-01-01T15:23:08Z
last_resumption: None
processing_end_date: 2017-01-01T15:43:37Z
end_date: 2017-01-01T15:30:18Z
elapsed_ms: 410689
Crawl Types
type: DAILY
recurrence_type: DAILY
pdfs_only: False
test: False
Crawl Limits
time_limit: 82800
document_limit: 1000
byte_limit: None
crawl_stop_requested: None
Crawl Results
status: FINISHED_DOCUMENT_LIMIT
discovered_count: 5072
novel_count: 339
duplicate_count: 674
resumption_count: 0
queued_count: 4059
downloaded_count: 1013
download_failures: 0
warc_revisit_count: 644
warc_url_count: 1012
total_data_in_kbs: 135417
duplicate_bytes: 116700379
warc_compressed_bytes: 6144534
Crawl Technical Details
doc_rate: 2.47
kb_rate: 329.0