LTE: Scraping web-sites to collect data

Zhenya — Wed, 27 Sep 2017 09:38:01 +0000

In this post, I detail how I collected the data for the letters to the editor corpus analysis project.

First, I picked several web-sites where I could access letters to the editor archives:

Chicago Tribune from Illinois
The Citizen from Georgia
Daily Herald from Illinois
Dubois County Free Press from Indiana
Ellsworth American from Maine
Enquirer Democrat from Illinois
Mount Desert Islander from Maine
Times Union from New York

While I was able to collect data from all of these web-sites, it later turned out that Ellsworth American and Mount Desert Islander limit access to 3 articles per month, so I had to discard their data. In the end it was a nice mix of urban (Chicago Tribune, Daily Herald), suburban (The Citizen, Times Union) and what seems like rural (Dubois County Press, Enquirer Democrat) media.

Each of the newspapers above has a page dedicated to letters to the editor. The structure of most of the pages is very similar and resembles a blog. The title page, usually with the URL something like site/letters-to-the-editor/ contains about 10 or 20 post snippets with links to the full letters in reverse chronological order. It also links other title pages that contain links to older letters to the editor, usually with the URL something like site/letters-to-the-editor/page#, where again there are about 10 or 20 post snippets. Thus, scraping these letters would be a relatively easy task. We would need to find all the links to the letters, follow them, and save the corresponding pages. Then, we would need to go to the next page, and repeat the process.

When I was just starting the project, I wanted to get the data as quickly as possible, and so I tried a Firefox extension, ScrapBook, to automatically download data for two of the web-sites, Chicago Tribune and Times Union. It worked relatively well. However, the only available filter option was to specify the link level, where the first-level links pointed to the newest posts and the second-level links to the older ones. It was also possible to restrict the domain to that site only, which eliminated advertisements and links to other web-sites. However, even if the domain was restricted, the pages still contained all sorts of links that led to unnecessary downloads. Once I received the content, I had to dig through the files to filter out the actual letters.

For the other sites I wrote a simple scraper using the scrapy package. I collected a URL list of the title pages, such as [site/letters-to-the-editor, site/letters-to-the-editor/page2, site/letter-to-the-editor/page3, etc.]. I then specified the format of the links of the letters so that the scraper could save them. Usually the letter URLs were something like site/letters-to-the-editor/date/title.html so it was again easy to specify which links to follow. It was not 100% clean, so sometimes I would get files on the same site that were not letters to the editor, but regular articles; however, the number of these oddball articles was much lower than with ScrapBook.

Using my own crawler instead of ScrapBook taught me a lesson about throttling the crawling speed. ScrapBook has an automatic sleep period between requests. When I wrote my own scraper, I learned the hard way that I needed to specify a long enough sleep time between requests. When I was scraping the Dubois County press web-site, I managed to bring it down with my downloads for a period of time, since initially I hadn’t set the sleep time at all, for which I apologize profusely.

I needed to specify the URL format for each site separately, as they differed slightly (although a couple of them had the same formatting, which was convenient). If I spent a little more time on this, I could have probably included a module to automatically determine the URL structure. However, I had only a handful of sites, and I wanted to get on with more interesting parts of the project, such as author and topic identification. But before I could get on to that, I had to label all of the data. Read about the labeling process in the next post.

data collection – Practical Linguist

LTE: Scraping web-sites to collect data