Documenting Time-Sensitive Campus Events Through Social Media and Web Archives Policy

1. Introduction

The ephemeral and dynamic nature of websites and social media communication requires that UTARMS take a proactive role in evaluating sources for long-term preservation as these formats are uniquely sensitive to loss. The nature of this documentation surfaces significant concerns regarding privacy and ethics, as well as how the technical capturing process shapes the records themself.

2. Purpose

The purpose of this policy is to outline criteria and documentation for assessing events which may require web archiving and/or the capture of social media. It also establishes departmental guidelines regarding the type and treatment of information captured within.

3. Policy

  1. Events are captured through social media or web archiving when they are understood to provide unique documentation that significantly impacts the understanding of the event, or the event itself.
  2. Events should have impact on the breadth of UofT and its communities and represent areas of substantial research interest.
  3. Because of staff resources and technological limitations, at this time, UTARMS will only collect webpages and tweets from user timelines and hashtags. This may be revisited as platforms and web archiving technologies change.
  4. Where possible, UTARMS will consult with content creators beforehand about our intentions to document said event and discuss any concerns they may have about the documentation of said event.
  5. UTARMS prioritizes individuals’ right to privacy including the “right to be forgotten”. If a tweet has been deleted, made private or an account suspended, the content will no longer be available when the dataset is accessed at a later date.
  6. With regards to websites, when a site owner authorizes communication of their work to the public over the internet without technological restrictions, we view this as their implicit consent to the indexing and caching of their site. Where a site uses technological protection measures to restrict crawling technology (such as robots.txt), we will not harvest the content without providing notification and securing permission.
    1. Collection finding aids will include an explicit sentence inviting copyright owners to contact UTARMS if they do not wish to have their work or website(s) made available through the collection: You are invited to contact us if you believe that any of the material shared in this collection represents an infringement of copyright, privacy or other legal right. Please include a link to the material in question, along with information about the nature and reason for your request.

4. Definitions

Content creators: Individuals or groups who oversee the creation of content on Twitter.

Hashtags: A word or phrase preceded by a hash sign (#), used on social media websites and applications, especially Twitter, to identify digital content on a specific topic. Tweets collected through hashtags are done using the research search API endpoint. This endpoint collects tweets from a search query (e.g. hashtag) from the last 7 days. Access to the full-archive search endpoint (dating back to March 2006) is available by applying to the Academic Research product track.

Hydrate: A command that will read a file of tweet IDs and write out the tweet for them into JSON using Twitter's status/lookup API. For further details, see the tool Hydrator.

Third-Party Access: In relation to Twitter’s Platform Usage Guidelines, UTARMS defines ‘Third-Party’ as any individual who is not a UTARMS staff member.

Tweet ID: A unique identifier associated with an individual tweet that is generated by Twitter’s API.

User timeline: Tweets published by a specific Twitter account. Tweets from a user timeline are collected using the user Tweet timeline API endpoint – it can return the 3,200 most recent Tweets, Retweets, replies and Quote Tweets posted by the user (no time limit).

Web archives: A collection of archived websites grouped by theme, event, or subject area. Web archiving is the process of creating an archival copy of a website. An archived site is a snapshot of how the original site looked at a particular point in time. The archive contains as much as possible from the original site, including text, images, audio, videos, and PDFs.

5. Procedures

  1. Events are evaluated using the Collection Decision Matrix (see Appendix 1). This is intended to identify various considerations and weigh these against each other in order to document and inform decision-making. This document is brought forward to and approved by the University Archivist.
    1. When an archivist thinks there’s something that may require proactive documentation by UTARMS, the archivist should complete a RASCI Responsibility Matrix.
    2. Once complete, the archivist will call a meeting as soon as possible with identified internal stakeholders to discuss and complete the Collection Decision Matrix.
    3. A copy of the completed matrix will be filed either in its Case File, or in the Departmental Shared Drive under ‘Declined Event Collections’

    If a decision has been made to proceed with the collection:

  2. Complete the Collection Questionnaire (See Appendix 2)
  3. For the collection of tweets from a specific Twitter user’s timeline, content creators are contacted by email to obtain consent prior to the start of collecting.
  4. Collect tweets using the Twitter API (using either Twarc or Social Feed Manager) in order to abide by Twitter’s Platform Usage Guidelines (see section Public Display of Tweets).
  5. Archive websites using the desktop tool. As of 2021, this currently provides the required flexibility to control and remove content. This is necessary to ensure our ability to fulfill privacy concerns over time. Web archiving tools and functionality will be reviewed and updated regularly.  
  6. At the completion of the event, accession the collection.
    1. Unless the content is explicitly administrative, the collection will be accessioned as a ‘B’.
    2. A copy of the completed matrix alongside any correspondence from content creators is to be included in the Case File.
  7. Only provide access to Tweet ID’s, abiding by Twitter’s Platform Usage Guidelines (see section Content Redistribution).
    1. Once the event has been deemed to have ended or been resolved, a finding aid will be created in Discover Archives describing the collection, its origins and the nature of the information collected.
    2. Tweet IDs will be stored on Dataverse with a link provided in the Finding Aid.
    3. Instructions will be provided to users on how to use the dataset and “re-hydrate” the tweet id’s using Hydrator and replay the WARC files using
  8. This policy will be reviewed every year in order to take into account any changes in technology or social media platforms Terms of Service or Usage Guidelines.

See also

Zefi Kavvadia. (2021, February 2). An Overview of Social Media Archiving Tools (Version 1.0). Zenodo.

Social media research ethical and privacy guidelines, GW University Libraries (Feb. 2018)

Appendix 1 - Collection Decision Matrix

  1. Name and brief description of event
  2. What aspects of the event would be captured through social media or web-based documentation? Are these duplicated elsewhere? Do these represent a unique view or perspectives on the event?
  3. What is the scope of the event? Which communities at UofT does this affect?
  4. Are the communities affected those that have been historically underrepresented in our collections and invisibalized at the institution at-large?
  5. Does it affect the UofT community broadly, for example reputationally?
  6. Are there features of the event and issues surrounding that are linked to significant areas of research interest?
  7. Is this a contained event or will it be on-going?
  8. What risks could the capturing and/or preservation of this content present to those involved or impacted?




Appendix 2 - Collection Questionnaire

Once it has been decided to start documenting a campus event, fill out the following: 

  1. What are the primary hashtags?
  2. Are there any accounts that we think would be useful to capture as well?
  3. After initial harvest, how frequently should we capture new content? (every day? every week?)
  4. Are there tweets from before last week that we would like to include in the collection?
  5. What date should tweet collecting commence? [Do we need to apply for the Academic Research Track?]
  6. Do we want to include only Twitter content or do we want to include other content, e.g. websites?

Last updated: 2021-09-30