Comments on: LTE: Scraping web-sites to collect data https://www.practicallinguist.com/lte-scraping-web-sites-collect-data/ Linguistics meets computers Tue, 05 Dec 2017 03:12:25 +0000 hourly 1 https://wordpress.org/?v=5.2.2 By: Zhenya https://www.practicallinguist.com/lte-scraping-web-sites-collect-data/#comment-2 Sat, 30 Sep 2017 05:25:23 +0000 http://www.practicallinguist.com/?p=174#comment-2 It is easy to just strip html tags from a web-page after downloading, for example using BeautifulSoup:

from bs4 import BeautifulSoup
text = "Some html"
soup = BeautifulSoup(text)

However, in my case, that was not the best solution. I didn’t need all of the text, since the letter web-pages include advertisements, links to other content, comments, etc. When I was parsing the text out, I used some analysis of the different divs on the page to only include the text of the letter. Also, I kept the formatting from HTML, which I used for features in machine learning.
As for gender, you make several very valid statements. I think in letters to the editor the noise is minimized. I read all of the letters, and those wishing to keep their anonymity usually used initials only, or did not sign their letters.
As for intentionally changing their name to the opposite gender, there is an interesting story that I heard. When I worked at a big company in Manhattan, my colleague (a woman) told me a story. She was a graduate student at some college in New York, and she was trying to get some question answered and emailed her professor. When she emailed as herself, she got brushed off. However, when she sent the exact same question under a male name, he answered it, and told her that it was a great question.
So it is entirely possible that some of the people who wrote those letters to the editor changed their names to the opposite gender. However, I think it’s important to consider the goal of letters to the editor. The goal is to have one’s opinion known to the community, so I think the number of such people is minimal. Mostly, people want to have their opinion known under their real name, or sometimes, anonymously. On a wild thought, someone might intentionally sign as someone else they know, but my estimate that number of such letters is very low.
Location has to be important, too, and I can analyze that data, and it would be really interesting to know what are the differences and the interactions between gender and location.
When there were long quotes in the letters, I discarded those, precisely for the reason that the text no longer represented the author.
Thanks for the thoughtful comments!

]]>
By: Dmitri Zdorov https://www.practicallinguist.com/lte-scraping-web-sites-collect-data/#comment-1 Fri, 29 Sep 2017 23:20:06 +0000 http://www.practicallinguist.com/?p=174#comment-1 There are specialized tools for website downloading, they can just get plain text.
Have you looked into this.

As for data itself:
it’s interesting, I think some people have the concept that their gender is mistreated and often write things using pseudonyms or made up names coupled with intentional gender change. That might effect the data you collect.
Also, the age difference, location, and other cultural factors can influence more than gender. And people can often copy snippets from other authors.
All this adds a lot of noise to the data.

As for the site and comments, try something like disqus.com

]]>