In March 2006, UTARMS began a pilot project to archive a select number of University of Toronto websites (12) through a subscription to Archive-It, a service offered by the Internet Archive. Over the years, this collecting has grown and evolved so that currently approximately 75 websites are crawled at minimum once a year. The result of this work demonstrates its value with 140 website URLs and counting currently available for research online. This policy is an attempt to better coordinate the ad-hoc nature of the collecting of the last 15 years and standardize the selection and frequency of web captures.
The purpose of this policy is to outline the scope and selection criteria of websites to be archived by UTARMS.
- The transition to digital information for most University business means a substantial amount of relevant information now exists as online resources. Some of this information constitutes University Records as defined in UTARMS Records Management Policy and Procedure, some document research conducted by faculty, and others, document aspects of student and campus life. UTARMS’ web archiving strategy strives to capture a sample that represents the breadth of online resources and engagements created by the broad U of T Community.
- UTARMS’ web archiving strategy relies on University administrative units to continue to follow University Records Management Policy and Procedure, and on Units to remain responsible for the maintenance, retention and disposition of all University Records created and supplied online. They should refer to the U of T File Plan for retention and disposition of website infrastructure that constitutes University Records.
- UTARMS' web archiving strategy also collects website content produced by faculty, associations and organizations affiliated with U of T that do not constitute the institution's administrative units. These websites are identified and evaluated in line with private records appraisal criteria, the Private Records Documentation Strategy, as well as the considerations outlined in Section 4.
- Presently, UTARMS collects websites that fall within the following areas:
- University Faculties
- Academic Calendars
- University Governance
- University News and Student Publications
- Student unions, Campus Labour Unions and the Faculty Association
- On a case-by-case basis, UTARMS will also crawl (one-time only) a website outside of these categories if:
- it is about to undergo a major redesign or is being discontinued/closed;
- websites related to private records acquisitions, such as a professor’s homepage.
- Major events affecting the campus may warrant additional web archiving collecting (e.g. COVID-19 pandemic). These will be addressed on a case-by-case basis; for further information, see Documenting Time-Sensitive Campus Events Through Social Media and Web Archives Policy.
4. Selection, Acquisition, and Frequency
- University websites are selected for collecting by the Digital Records Archivist in consultation with the Records Archivist and the University Archivist.
- Websites created by individuals and organizations distinct from the University and its administrative functions, for example professors, student organizations, or labour groups, are identified and assessed by the Digital Records Archivist and the archivist working with a particular donor or group, or the Private Records Archivist.
- The following considerations are evaluated when deciding whether and how often to crawl a website:
- The creator and purpose of the online resource, including the relevance or significance of the content to UTARMS' mandate and the Private Records Documentation Strategy.
- The ongoing availability of the information including consideration of the following:
- copyright and privacy,
- technological capabilities and limitations to capture and preserve,
- the stability or temporal nature of the content,
- whether the site is the best record of the content
- U of T staff, students, and community members are also encouraged to submit websites for possible crawling by completing the website nomination form. Submissions are assessed based on the criteria above distinguishing between A and B acquisitions.
- If the website is not part of the utoronto.ca domain, website owners will be contacted to obtain permission to archive their website prior to the website being crawled.
- Websites will be captured through the Archive-It service as part of the broader University of Toronto Libraries subscription.
- In some instances, the ArchiveWeb.page desktop tool will be used instead, with WARC files subsequently uploaded to Archive-It for access.
- Due to technological scoping challenges and size, embedded streaming media players with video or audio content is considered out-of-scope of our web archiving strategy and will be appraised, scheduled, and acquired separately.
- Unlike other types of records acquired by UTARMS, web archives are not accessioned because of the frequency of their acquisition throughout the year. Tracking of captured websites is maintained through the A and B seed lists spreadsheet.
Archived websites may be viewed and searched through UTARMS' Archive-It page. In addition, websites are described in DiscoverArchives and associated with existing fonds/collections at the accession-level for A’s and the series-level for B’s.
b. Descriptive Metadata
Basic DublinCore metadata (Title, Description, Subject) is added at the collection and seed-level.
c. “Look and Feel”
Every effort will be made to ensure the “look and feel” of the website on the date of the crawl, but in some instances the web crawler may not be able to preserve the exact form, functionality, and content of sites as they appear on the web. The following types of content present significant issues for capture and/or display:
- Streaming media players with video or audio content
- Password protected material
- Forms or database-driven content that requires interaction with the site
- Exclusions specified in robots.txt files
a. Ownership of Content
UTARMS asserts no claim of copyright or any other intellectual property right in the content included in the web archive collection. These materials are intended for use for educational purposes only. Copyright over archived web material remains with the owner(s) of the web content.
b. Archiving Third-Party Content
When a content owner posts a work openly on the internet, they are authorizing communication of their work to the public by telecommunication, including by automated systems. We view this as their implied license to the indexing and caching of their site. In making this content available, we are furthering one of the goals of copyright, which is to promote the wider public dissemination of works. In cases where websites are posted to the web in such a way that signals a prohibition on caching by automated systems, UTARMS may still include such sites in its web archives via a case-by-case fair dealing determination, provided that the collecting activity considers factors such as the nature of the collected materials themselves, and the risks that the preservation of this content may pose to those involved.
c. Authorized Use of Third-Party Content
d. Permission To Use Third-Party Content
All requests for permission to reproduce and use the archived content must be sent to the copyright holder(s) directly. We will not act as intermediaries to any such transaction. Users are responsible for identifying the copyright status of the archived content, as well as identifying and contacting the appropriate authority for permission. All rights in the content are presumed to remain with the owner(s) identified on the website.
7. Take-Down and Opting Out
Website owners are invited to contact us to opt out of the archiving of their site.
UTARMS is committed to maintaining an accurate and authentic historical record of websites relevant to our collections policies. UTARMS also recognizes there may be legal or ethical reasons to cease archival collection or provision of a website in this program. Website owners and individuals are invited to contact us with questions or concerns about content in our web archives, or to request a site be removed from our collection or access platforms.
Captured files are stored in the WARC archival file format. Files are stored by the Internet Archive as part of our subscription to the Archive-It service.
- Queen's University
- University of Ottawa
- University of Victoria
- University of Alberta
- University of California San Francisco
- Michigan State University
Last updated: 2021-12-03