machine learning – Practical Linguist

LTE: Labeling data for machine learning

Zhenya — Fri, 17 Nov 2017 13:35:05 +0000

For the project of automatically assigning topics to the letters to the editor, I needed labeled data. Sometimes blog posts, or articles in a newspaper will have assigned labels (for example, this post is tagged with “machine learning” and “natural language processing”). However, none of the newspapers I got my data from did that. Thus, I needed someone to go through the letters and label them.

Before tagging the letters, I needed to decide on a list of topics. I went through a small sample of letters, and came up with a pretty comprehensive list that reflected subjects of most current events:

Politics
Healthcare
Society
Media
Community
Education
Government
Environment
Legislation

Technology
Economy
Crime
Religion
History
Military
Science
Sports
Business

Up until I started labeling, my idea was that each letter would have one main assigned topic. However, I quickly discovered that that might turn to be problematic, as this letter illustrates:

Very soon, Congress will be looking at a new version of the American Health Care Act, the “replace” portion of the push to replace the ACA. The so-called “MacArthur amendment” is designed to appease some of the Freedom Caucus by allowing states to waive the community rating provision. This means that while insurers will still have to technically offer insurance to those with pre-existing conditions, they can charge significantly higher rates for them. What this will do in practice is to give the illusion of covering pre- existing conditions, but will price people completely out of the market by charging premiums so high that no one could afford them. Some suggest high risk pools could cover these people, but these pools have proved to be a loser in all 33 states where they existed, losing a total of $2 billion the year before the ACA eliminated them. Tell Reps. Hultgren and Roskam that those with pre-existing conditions must retain the benefits of the ACA.

Which topic would you assign to this letter? Politics, healthcare, legislation? Clearly, all these topics are important in the letter. There were many letters that could fit into more than one category. I allowed for this by tagging each letter with a main topic, and several secondary ones. Thus, we might assign topics such as “Legislation”, “Healthcare” and even “Economy” to the letter above. The best topic for it would probably be “Healthcare reform”, a transient category that exists only for a few weeks when it is relevant. While often we cannot predefine such topics, we can explore them if we use the letters without labels with an unsupervised machine learning algorithm, such as clustering. In this case, topics would have to be discovered from the data.

Also, frequently, the letter would not neatly fit into any of the categories. For example, this letter talks about antibiotics in the food industry, where the author of the letter summarizes his reasons for why antibiotics and animal crowding are not problematic. While it is indirectly related to healthcare, it does not really talk about it. The label I thought was fitting was “Business”. Some new labels would probably have to be created specifically for this letter, such as “Food”.

Now that I had predefined topics and letter data, I had to set up the mechanics of the process. I created a simple web-app that could be used to assign topics and copy the author and title. The app included a frame with the letter and another frame with the labeling options: text fields for the author and title, a drop-down menu for the main topic and checkboxes for the secondary topics. Once a letter was submitted, a new one would be loaded, until all letters were labeled. Here is a screenshot of the app:

There were around 1600 letters that needed labeling. One option I could have used is a crowdsourcing web-site. I did take a look at some of them, Amazon Mechanical Turk, Fiverr, Onespace and Freelancer. However, I decided against them, since it would add significant time overhead, including quality control of completed work. I labeled all of the 1600 letters myself, which was a very educating experience. Here are some of my takeaways about labeling machine learning data:

The taggers should be able to add new labels. In case there are several people working on the task, they should coordinate the naming of the new categories.
Ideally, a team of taggers would label the letters, where the final label would be assigned by consensus between taggers.
Any list of topics may be limiting in a task where the items to be labeled are about current or evolving events. Depending on the application, unsupervised methods that cluster the items together are worth considering.

After I labeled all the data, I took a quick look at the topic statistics. Some interesting patterns emerged; read about it in the next post.

LTE: Letters to the editor corpus analysis using machine learning

Zhenya — Tue, 26 Sep 2017 15:38:43 +0000

See if you can determine whether the author of the following texts is male or female.

Text A:

When my children were younger, my goal was to get them into the gifted program or even charter schools, because I just wanted a high-quality option. I was thankful that one of my daughters was accepted to Alexander Graham Bell Elementary School’s gifted program. But my other daughter, who was not as advanced, did not make it in. My oldest child, who attended Alexander Graham Bell, was able to get a great education.

Text B:

Gov. Bruce Rauner rocks the political boat. He has no personal agenda other than decent government. Former governors, of both parties, were all about typical politics: money, power and ego/legacy (see: Barack Obama). Of course they worked well together; they sold their principles to “get what they could,” not what they should. Democrats’ naivete regarding Illinois’ (and Chicago’s) blatant corruption is laughable. Illinois doesn’t deserve our honest and hardworking governor. It likes the stink of the status quo.

Did you pick text A to be female-authored and text B male-authored? That’s what I would have done, going along with societal stereotypes. But of course, I hand-picked texts that are just the opposite. A is written by a man, B, by a woman.

When deciding who wrote which text, which factors influenced your decision? The topic, probably. Maybe how personal it is. In some languages, it is almost trivial to determine the gender of the writer. For example, Spanish has different male and female adjective endings:

estoy cansada 'I am tired-f'
estoy cansado 'I am tired-m'

Slavic languages also have a gender distinction, different verb endings in the past tense for male and female forms:

ja wypiła 'I drank-f'
ja wypił 'I drank-m' (Polish)

English does not have anything like this; however, there are clues that can give out the gender of the author. Sometimes, there are clear indications, such as self-identifying expressions; for example, as a father. Other times, we can use our judgement based on perceived correlations, for example, topic, which can sometimes be wrong, as the two texts above show.

Topic is one of the many questions that arise when we consider differences in female and male writing. Do women and men write about different things? I would argue that yes, they usually do; but I would like to have data to back that up. To find evidence, I collected a corpus of publicly available letters to the editor. While I started with the idea of looking at differences between male and female writing, it evolved into something much larger. This is going to be a series of posts about this project.

Here is an outline of the work that I did:

Corpus collection. Where to get texts with the gender of the author known? My friend had a brilliant idea: letters to the editor. Usually (but not always), the letters are signed, and most of the names can be labeled as female or male. Also, most letters to the editor are freely available on the Internet. I scraped several sites to collect between 200 and 1500 letters per site.
Corpus labeling. I wanted to compare supervised and unsupervised methods for topic identification in my letters to the editor (LTE) corpus. For the supervised topic labeling, I needed to manually tag the corpus. I developed a simple web-app to do that.
Text extraction. The letters were all in the HTML format, and they all contained other information in addition to the text of the letter itself, so I wrote a program to automatically extract the relevant text.
Author identification. Since I already tagged my corpus for author and topic, I wanted to see how well a machine learning algorithm would do in automatically extracting the author of the letter from the file, which is not a trivial task, as it may seem at first.
Automatic topic assignment, supervised. I labeled each of the letters with their topics, and in this stage I used that information to train a classifier to automatically assign the topic given the text of the letter.
Automatic topic assignment, unsupervised. A technique called LDA clustering allows us to group the documents according to their similarity to each other. I used it to group the letters and to compare the resulting clusters to the topics from the supervised topic assignment task.
Expectations and results. What are the theoretical expectations about the data? How do they compare to the results? Which topics are more common than others? What do men and women write about? Are there any differences by location?