LTE: Scraping web-sites to collect data

In this post, I detail how I collected the data for the letters to the editor corpus analysis project.

First, I picked several web-sites where I could access letters to the editor archives:

While I was able to collect data from all of these web-sites, it later turned out that Ellsworth American and Mount Desert Islander limit access to 3 articles per month, so I had to discard their data. In the end it was a nice mix of urban (Chicago Tribune, Daily Herald), suburban (The Citizen, Times Union) and what seems like rural (Dubois County Press, Enquirer Democrat) media.

Each of the newspapers above has a page dedicated to letters to the editor. The structure of most of the pages is very similar and resembles a blog. The title page, usually with the URL something like site/letters-to-the-editor/ contains about 10 or 20 post snippets with links to the full letters in reverse chronological order. It also links other title pages that contain links to older letters to the editor, usually with the URL something like site/letters-to-the-editor/page#, where again there are about 10 or 20 post snippets. Thus, scraping these letters would be a relatively easy task. We would need to find all the links to the letters, follow them, and save the corresponding pages. Then, we would need to go to the next page, and repeat the process.

When I was just starting the project, I wanted to get the data as quickly as possible, and so I tried a Firefox extension, ScrapBook, to automatically download data for two of the web-sites, Chicago Tribune and Times Union. It worked relatively well. However, the only available filter option was to specify the link level, where the first-level links pointed to the newest posts and the second-level links to the older ones. It was also possible to restrict the domain to that site only, which eliminated advertisements and links to other web-sites. However, even if the domain was restricted, the pages still contained all sorts of links that led to unnecessary downloads. Once I received the content, I had to dig through the files to filter out the actual letters.

For the other sites I wrote a simple scraper using the scrapy package. I collected a URL list of the title pages, such as [site/letters-to-the-editor, site/letters-to-the-editor/page2, site/letter-to-the-editor/page3, etc.]. I then specified the format of the links of the letters so that the scraper could save them. Usually the letter URLs were something like site/letters-to-the-editor/date/title.html so it was again easy to specify which links to follow. It was not 100% clean, so sometimes I would get files on the same site that were not letters to the editor, but regular articles; however, the number of these oddball articles was much lower than with ScrapBook.

Using my own crawler instead of ScrapBook taught me a lesson about throttling the crawling speed. ScrapBook has an automatic sleep period between requests. When I wrote my own scraper, I learned the hard way that I needed to specify a long enough sleep time between requests. When I was scraping the Dubois County press web-site, I managed to bring it down with my downloads for a period of time, since initially I hadn’t set the sleep time at all, for which I apologize profusely.

I needed to specify the URL format for each site separately, as they differed slightly (although a couple of them had the same formatting, which was convenient). If I spent a little more time on this, I could have probably included a module to automatically determine the URL structure. However, I had only a handful of sites, and I wanted to get on with more interesting parts of the project, such as author and topic identification. But before I could get on to that, I had to label all of the data. Read about the labeling process in the next post.

2 thoughts on “LTE: Scraping web-sites to collect data”

  1. There are specialized tools for website downloading, they can just get plain text.
    Have you looked into this.

    As for data itself:
    it’s interesting, I think some people have the concept that their gender is mistreated and often write things using pseudonyms or made up names coupled with intentional gender change. That might effect the data you collect.
    Also, the age difference, location, and other cultural factors can influence more than gender. And people can often copy snippets from other authors.
    All this adds a lot of noise to the data.

    As for the site and comments, try something like disqus.com

    1. It is easy to just strip html tags from a web-page after downloading, for example using BeautifulSoup:

      from bs4 import BeautifulSoup
      text = "Some html"
      soup = BeautifulSoup(text)

      However, in my case, that was not the best solution. I didn’t need all of the text, since the letter web-pages include advertisements, links to other content, comments, etc. When I was parsing the text out, I used some analysis of the different divs on the page to only include the text of the letter. Also, I kept the formatting from HTML, which I used for features in machine learning.
      As for gender, you make several very valid statements. I think in letters to the editor the noise is minimized. I read all of the letters, and those wishing to keep their anonymity usually used initials only, or did not sign their letters.
      As for intentionally changing their name to the opposite gender, there is an interesting story that I heard. When I worked at a big company in Manhattan, my colleague (a woman) told me a story. She was a graduate student at some college in New York, and she was trying to get some question answered and emailed her professor. When she emailed as herself, she got brushed off. However, when she sent the exact same question under a male name, he answered it, and told her that it was a great question.
      So it is entirely possible that some of the people who wrote those letters to the editor changed their names to the opposite gender. However, I think it’s important to consider the goal of letters to the editor. The goal is to have one’s opinion known to the community, so I think the number of such people is minimal. Mostly, people want to have their opinion known under their real name, or sometimes, anonymously. On a wild thought, someone might intentionally sign as someone else they know, but my estimate that number of such letters is very low.
      Location has to be important, too, and I can analyze that data, and it would be really interesting to know what are the differences and the interactions between gender and location.
      When there were long quotes in the letters, I discarded those, precisely for the reason that the text no longer represented the author.
      Thanks for the thoughtful comments!

Leave a Comment

Your email address will not be published. Required fields are marked *