LTE: Letters to the editor corpus analysis using machine learning

See if you can determine whether the author of the following texts is male or female.

Text A:

When my children were younger, my goal was to get them into the gifted program or even charter schools, because I just wanted a high-quality option. I was thankful that one of my daughters was accepted to Alexander Graham Bell Elementary School’s gifted program. But my other daughter, who was not as advanced, did not make it in. My oldest child, who attended Alexander Graham Bell, was able to get a great education.

Text B:

Gov. Bruce Rauner rocks the political boat. He has no personal agenda other than decent government. Former governors, of both parties, were all about typical politics: money, power and ego/legacy (see: Barack Obama). Of course they worked well together; they sold their principles to “get what they could,” not what they should. Democrats’ naivete regarding Illinois’ (and Chicago’s) blatant corruption is laughable. Illinois doesn’t deserve our honest and hardworking governor. It likes the stink of the status quo.

Did you pick text A to be female-authored and text B male-authored? That’s what I would have done, going along with societal stereotypes. But of course, I hand-picked texts that are just the opposite. A is written by a man, B, by a woman.

When deciding who wrote which text, which factors influenced your decision? The topic, probably. Maybe how personal it is. In some languages, it is almost trivial to determine the gender of the writer. For example, Spanish has different male and female adjective endings:

estoy cansada 'I am tired-f'
estoy cansado 'I am tired-m'

Slavic languages also have a gender distinction, different verb endings in the past tense for male and female forms:

ja wypiła 'I drank-f'
ja wypił 'I drank-m' (Polish)

English does not have anything like this; however, there are clues that can give out the gender of the author. Sometimes, there are clear indications, such as self-identifying expressions; for example, as a father. Other times, we can use our judgement based on perceived correlations, for example, topic, which can sometimes be wrong, as the two texts above show.

Topic is one of the many questions that arise when we consider differences in female and male writing. Do women and men write about different things? I would argue that yes, they usually do; but I would like to have data to back that up. To find evidence, I collected a corpus of publicly available letters to the editor. While I started with the idea of looking at differences between male and female writing, it evolved into something much larger. This is going to be a series of posts about this project.

Here is an outline of the work that I did:

  1. Corpus collection. Where to get texts with the gender of the author known? My friend had a brilliant idea: letters to the editor. Usually (but not always), the letters are signed, and most of the names can be labeled as female or male. Also, most letters to the editor are freely available on the Internet. I scraped several sites to collect between 200 and 1500 letters per site.
  2. Corpus labeling. I wanted to compare supervised and unsupervised methods for topic identification in my letters to the editor (LTE) corpus. For the supervised topic labeling, I needed to manually tag the corpus. I developed a simple web-app to do that.
  3. Text extraction. The letters were all in the HTML format, and they all contained other information in addition to the text of the letter itself, so I wrote a program to automatically extract the relevant text.
  4. Author identification. Since I already tagged my corpus for author and topic, I wanted to see how well a machine learning algorithm would do in automatically extracting the author of the letter from the file, which is not a trivial task, as it may seem at first.
  5. Automatic topic assignment, supervised. I labeled each of the letters with their topics, and in this stage I used that information to train a classifier to automatically assign the topic given the text of the letter.
  6. Automatic topic assignment, unsupervised. A technique called LDA clustering allows us to group the documents according to their similarity to each other. I used it to group the letters and to compare the resulting clusters to the topics from the supervised topic assignment task.
  7. Expectations and results. What are the theoretical expectations about the data? How do they compare to the results? Which topics are more common than others? What do men and women write about? Are there any differences by location?

2 thoughts on “LTE: Letters to the editor corpus analysis using machine learning”

    1. Thanks for reading!

      Supervised methods mean that you have “the right answer”, labels, for your data. For example, if you are trying to train a classifier to determine whether an email is spam or not, you have a big set of actual emails and each email comes with a label, “spam” or “not spam”. Then, you can estimate the error of your classifier, precision, recall; evaluate it according to the given data.

      When you have unsupervised methods, you don’t have any “right answers”, or labels, for your data. One example is clustering documents by topic where the topics are not known in advance. The algorithm tells you, here are the clusters (and you pick the number of clusters) of documents that look similar to each other. I would think Twitter does something like this, as they can have lots of new data that doesn’t have preexisting topic labels.

Leave a Comment

Your email address will not be published. Required fields are marked *