Practical Linguist

Ensuring data quality for LaborEdge: a case study

Zhenya — Wed, 02 Oct 2019 15:13:08 +0000

In August I did a data quality project with LaborEdge, automating part of their data standardization process using Natural Language Processing (NLP) techniques. Data standardization is part of a larger host of tasks ensuring data quality. In this post I am going to tell you about the project, data quality and how NLP can help raise it.

Data quality has enormous impact on businesses of all sizes. All businesses have data, ranging from a CRM database to complex business process data, and the cost of low-quality data can be very high.

NLP techniques can help to improve data quality

Many business professionals, when interviewed, say some variation of “I don’t trust my data.” Low quality data can include inconsistencies, such as different categories assigned to the same product, duplicate records, or incomplete data. Lack of confidence in the data slows the businesses down, largely due to the time spent reconciling and verifying data against expectations (Modern Data Strategy, Fleckenstein and Fellows). The business losses stemming from poor data quality can be enormous. In 2013, the 140 businesses surveyed by Halo Business Intelligence estimated an average loss of $8M due to low quality data. On the other hand, higher data quality leads to business improvements:

10-20% reduction in corporate expenses
40-50% reduction in IT costs
Up to 40% reduction in operating costs

One example of how poor data quality can affect the bottom line is duplicates in the data. For example, in medicine this a very real problem with an average cost of $50 per duplicate record.

Another aspect of data quality is ensuring that data is normalized according to internal or external standards. Extracting information from outside data sources requires that it be standardized if the resulting information is later used for a uniform purpose. If it’s not, having unstandardized data can lead to several problems:

Standardizing data is very important

Incorrect product or service matches, directly affecting the bottom line
Errors in reporting that can affect decision making
Mistakes in compliance and adherence to business rules

Natural language processing techniques are instrumental in standardizing text data, and this is the data quality aspect with which I helped LaborEdge, a company that makes healthcare staffing software.

LaborEdge bases its system for classifying duty shifts for healthcare professionals on time of day and shift duration. However, the shifts themselves are compiled and input by staffing agencies, which can mean tremendous variation in the format of the shift data. For example, depending on the agency, shift data can include any, all, or none of the following: shift start, shift end, number of shifts per week, number of hours in one shift, usual time of day for the shift, and be in either military or standard time. Some examples of input phrases are “0700 – 1730”, “10 Hour Days”, “12hr d”, “7a” and “12H Nights: 7:00 PM – 7:00 AM”.

LaborEdge’s classification system is necessary, therefore, for their job-matching software, since it standardizes the ‘shift’ values for the software to be able to match jobs and candidates. Some of the difficulties that LaborEdge encountered with a previous classification system were:

Inconsistent and unnecessary ‘shift’ classes caused by manual classification
Complex matching rules for regular expression. Updating one rule could potentially break other ones, so debugging became difficult
Inconvenience because they needed a programmer to update the matching rules

To solve this problem, I first took a detailed look at the data to better understand the underlying processes and the existing classes. Then I designed a system that used a combination of NLP approaches involving both machine learning and regular expression matching, with matching rules being a small but robust part built on top of machine learning. Since the number of entities to be parsed out was finite and the entities themselves well-defined, the first stage was training an entity extraction model that identified shift start, shift end, shift duration, time of day, and shifts per week from the phrase. Whatever it didn’t catch or didn’t get right was processed by the regular expression rules.

The result is an automatic system that works without the need for updates or manual intervention and that improves the quality of the data by standardizing it. What I would do to complete this is to use the output of the rule-based system as a labeled dataset to create a machine learning system instead of the rule-based one, so that less time is spent on modifying the rules.

Data quality is of utmost importance for business processes and improving data quality cuts costs by 40-50%. Standardizing data coming in from external sources is very important, as it allows your system to extract information of interest and improve reporting and compliance. In the LaborEdge case I used a combination of NLP approaches to create a robust shift description standardization system that eliminates the need for time-consuming manual updates and creates higher quality data.

If you would like help with data quality and standardization of text data, please send me an email to zhenya@practicallinguistics.com and we can discuss the details.

How Natural Language Processing (NLP) can reduce costs, improve productivity, and raise profits in business

Zhenya — Wed, 24 Jul 2019 14:41:01 +0000

Almost all businesses deal with texts: invoices, proposals, research papers, reports, resumes, job descriptions, emails, news, and other documents. Some of these text documents require processing, such as sorting, extracting and entering information into a database, evaluating for sentiment, and so on. According to Forbes Magazine, 84% of businesses still rely on some sort of manual data processing every day, and many of those tasks could be automated with the right technology. In this post I talk about various uses of Natural Language Processing (NLP) in business and how it could help automate manual text processing.

Most companies are in some form or another involved with marketing, accounting, customer service, sales, and hiring, and each of these elements can benefit from Natural Language Processing.

Marketing

In marketing, NLP and other AI solutions are being widely applied, with chatbots probably the most well-known of these technologies. Chatbots are an alternative to email newsletters for delivering marketing messages. When used this way, chatbots can better segment an audience and improve reach to potential customers. Compared to email newsletters, chatbots currently offer significantly better click-through-rates with a range of 15-60%, compared to the newsletter’s rate that tops out at 5%.

Chatbots are also an alternative to website forms that potential customers use to reach out to businesses. According to Conversational Marketing , 58% of companies never followed up with website visitors who filled out forms, while 81% of tech buyers don’t bother filling out forms at all. Chatbots, on the other hand, respond much faster, which significantly increases the chances of the response arriving when it’s most effective (within five minutes, according to Conversational Marketing).

Social media and user surveys analysis

Using social media and customer reviews analysis it is possible to unlock insights present in the data.

Usually marketing also involves analyzing social media and customer review messages. Using NLP technology, businesses can break down social media posts and reviews into parts, sort those parts by topic and, by applying sentiment analysis to them, find out how customers feel about different aspects of the product or service, such as prices, customer service, quality, etc. The same techniques can be applied to user feedback surveys.

Accounting

Accounting involves lots of data entry, such as processing financial reports, receipts and invoices. Manually entering such data can eat up a lot of employees’ time. Automating these tasks can eliminate these time sinks and save businesses money. For example, Botkeeper, a bookkeeping program that uses AI to handle accounting, automated more than 1.2 million hours for its 1,000 clients. By one estimate, employees who saved 240 hours due to automation return $9,240 in value to their employers.

Customer service

Chatbots can be a timesaver for customer service in addition to marketing. Chatbots can instantly answer frequently asked questions and redirect more difficult questions to humans. For example, Shell created digital assistants to answer questions about their lubricant business, and the company says the technology has already reduced its call center volume by 40%. In addition, 99% of users’ expectations have been met.

Here is an example of how a chatbot can answer frequently asked questions from Bloomberg help desk:

19:07:44 USER : hi. how can i completely remove a permanent chat that i created earlier?
19:07:45 BLOOMBERG HELP DESK : Thank you for using Bloomberg HELP!
19:07:49 BLOOMBERG HELP DESK : To delete a chat which you created, from the IB Manager, click the chat name. At the top right of the chat, click the icon with three horizontal lines, then select “Delete”.

Sales

In addition to chatbots, interacting with customers using suggestions, emails, and other messages that are personalized based on their behavior is a sure way to increase sales. Personalized recommendations are the products that Amazon and other online retailers display when customers buy a related item. One such retailer implemented a product recommendation engine that boosted conversion rates by 32% and increased profits by 23%.

HR (Hiring)

Recommendation engines can also be used in hiring by ranking resumes that best match a job description (while being careful to avoid bias). They can reduce the time it takes to go through all the resumes received as applications.

In addition, by using NLP, it is possible to automatically extract information from resumes, including names, education, experience, and specific skills and then store in a database for easy retrieval. This data, combined with information about placements, can then be used to analyze which candidates do especially well at certain companies (again, being careful about to how this information is being used).

Other

Anywhere there is lots of textual data, NLP and AI can improve and reduce manual processing of such texts.

Long documents: many businesses deal with documents such as proposals and reports, and parts of them can be reused. Using NLP it is possible to divide documents into parts, and then bring up sections that are about a related topic when writing new reports. This is especially relevant to the legal domain, where “in many cases, the amount of data requiring analysis can exceed 100GB, when often only 5-10% of it is actually relevant. With outside service bureaus charging $1,000 per GB to filter and reduce this volume, you can start to see how costs can quickly soar.”
Managing emails: companies with large email volume can benefit from automatically sorting the messages and redirecting them to the right person. Using NLP techniques, it is possible to sort the messages into categories using custom criteria, reducing manual processing time.
Analyzing email traffic can also help analyze business processes: heavy email traffic on a certain topic can predict potential problems in communications between and within departments. With this foreknowledge, companies can intervene early on. An aerospace manufacturer implemented such email analysis and as a consequence cut costs by approximately €40 million.
Custom news reports: following news relevant to an industry can be very time consuming, and setting up a regular news report can cut down on that time significantly. The report can include news about competitors and the company itself, as well as any other custom criteria.

As you can see, NLP is a technology that can potentially reduce costs, increase productivity, and raise profits for a business of any size and any domain.

If you would like help with an NLP project, contact me at zhenya@practicallinguistics.com or reach out on LinkedIn. You can also reach out to me for a text data audit, where we look at your text data and I will show you how NLP techniques can automate some of your manual processes involving that data.

What is NLP?

Zhenya — Thu, 02 May 2019 13:47:20 +0000

I’m often asked, what is it that you do? When I mention the word NLP, frequently the next question is what is that? So here is a blog post that has both the short and the long versions of my answer.

Here is the short answer. NLP stands for Natural Language Processing. The “Natural” in Natural Language Processing refers human language in the human and computer languages dichotomy. “Processing” really means automated computer processing. In short, NLP is automated programs that try to understand or generate pieces of human language. NLP often uses AI techniques.

Examples

Machine translation

Here are some simple applications of NLP in our everyday lives:

Spellcheckers: programs that check spelling and grammar, identify mistakes and then offer suggestions.
Suggested words in text messaging: in order to predict the next word for your message, the program processes lots of text and computes probabilities of the words that come after each word you type.
Alexa, Siri and other voice assistants: these use complex NLP techniques to understand what is being said and then perform the requested action.
Online search: there are many ways how NLP is used in search. For example, one way is to ensure that “Avengers showtimes” will show up in the search results even when we search for “Avengers show time”.
Google translate and other machine translation apps: in order to translate from one language to another, the translation tools build neural network models from large collections of text in pairs of languages.

What else can NLP do?

Sentiment analysis of customer reviews

Everywhere there is a large collection of text, NLP can likely be applied to it. There are many ways to do it, and here are a few:

Extracting information from documents. Many companies deal with large sets of documents that they use in their daily processes. For example, recruiting companies collect resumes and then extract information from them, such as name, years of experience, education, contact information, etc. Using NLP, it is possible to build a system that will parse the resume, extract the needed information and record it in a database.
Extracting features and topics, for example in customer reviews and social media mentions. Customers often talk about different aspects of a product or service they are reviewing in a single review; for example, price, delivery, customer service, quality, durability, etc. With natural language processing it is possible to build a system that will collect reviews about a product or service and extract the different features that are being talked about in the reviews.
Analyzing text for positive and negative feelings. There are NLP tools that let us build systems that determine if the text is positive, negative, neutral, or a more nuanced set of emotions. For example, customer review features can be analyzed for positive or negative sentiment to reveal how customers feel about different features of a service or product.
Classifying documents. Many times, there is a need to classify a set of documents into different subgroups. For example, companies may rely on news as their source of information, and with NLP it is possible to create a program that will divide the incoming news stream into relevant and irrelevant news pieces, and further divide the relevant ones by topic, such as technology, economy, politics, etc.

As you can see, NLP is usually applied to specific tasks, as opposed to understanding and creating language the way humans do. These two are very hard to achieve.

How does NLP relate to AI?

Artificial Intelligence

AI, or Artificial Intelligence, is a set of techniques where the computer uses existing data to learn a function, then applies it to future data to produce answers to various problems. AI can be applied to many different domains, such as facial recognition in images, automatic transcription of audio into text, financial fraud detection. In that sense AI is used in NLP like in any other domain. NLP also uses techniques outside of AI. NLP programs can be built with three different approaches:

Rules. For a given NLP task we can write a set of rules that will produce a result that is close to what is being desired. For example, if we are tasked with separating an incoming news stream into different topics, we can write a set of keywords for each topic and use that as a predictor. While this is the simplest approach, it is very labor intensive, and usually it is hard to cover all the presenting cases.
Machine learning. With machine learning we can devise an algorithm that will learn itself which news texts should be classified as a particular topic. In order for the algorithm to work, the incoming items first need to be represented as a set of features, and in this case, it could be the list of all the words in the article, the title, bolded words, numbers, proper nouns mentioned, etc. Also, we need a representative sample of the documents. This is one of the most used techniques nowadays. For example, the task of dividing a news stream into different topics can be easily achieved using this technique.
Deep learning. When the representative set of documents being processed is large enough (usually a lot larger than what’s needed for feature-based machine learning), deep learning can be applied. Here a neural network (named so because it tries to mimic the way human neural processes work) learns the function needed to perform the task. The advantage of deep learning is that engineers building the system do not need to spend too much time figuring out the features to represent the incoming items. However, the amount of data and the computational power required are very large. For example, the most accurate results in machine translation (such as Google translate) are achieved using deep learning.

These last two techniques are considered part of Artificial Intelligence.

Did this article help you understand the main aspects of Natural Language Processing? Let me know in the comments.

If you are a business with a large collection of text documents and you would like to automate the manual process in place, feel free to contact me at zhenya@practicallinguistics.com.

Analyzing Social Media Posts and Customer Reviews by Topics: an Important Data-driven Marketing Tool that Helps Reveal Trends in Different Business Aspects

Zhenya — Tue, 20 Nov 2018 21:47:27 +0000

Social media posts and reviews from review sites like Yelp are an important and powerful marketing and feedback tool for businesses. They are readily available on the Internet and provide customers’ opinions on different aspects of restaurants, car mechanics, cinemas, IT consulting firms, mobile apps, Internet hosting companies and every other kind of business out there. The data insights embedded in these customer messages are very informative, but are not readily available without further analysis.

Consider this review of Pomodoro, an Italian restaurant:

Don’t let the pizza parlor storefront or steep, narrow flight of stairs put you off, this place really knows how to do homemade Italian and the price is right! BYOB, not too crowded on a Friday night, great service, and very good food (special attention to the homemade pastas and sauces- pappardelle and black squid linguini were best). All in all, looking forward to returning!! [Emphasis mine]

The review talks about several topics: price, quality of food, and level of service. We can easily parse out those brief evaluations from this one particular review just by looking at it, but imagine trying to sieve through hundreds of such descriptions. It would take a very long time to discover the relevant topics and to analyze customer feedback on those topics manually.

Breaking down reviews by topic and assessing sentiment on them is easily done using NLP tools. The process involves the following steps:

Collect reviews and social media mentions for a business, in this case, the Italian restaurant Pomodoro.
Break reviews into parts (“this place really knows how to do homemade Italian”, “the price is right!”, “great service”, “very good food”, “looking forward to returning!”, etc.). Link the review parts with the complete reviews for more detailed analysis.
Classify review parts into topics (price, service, food quality, etc.).
Run sentiment analysis on each review part to determine whether it’s a positive or negative comment.

What you get from such an analysis:

Clarity and efficiency: with all relevant review parts grouped by topic, there is no need to search and look through the reviews to see what customers think about each aspect of the business.
Insights over time: if the analysis is run regularly, weekly or monthly, it is easy to see how the customer sentiment changes over time with respect to different aspects of the business. This gives business owners a very useful marketing and customer service tool that will quantify their efforts to improve their business.

Here is an illustration of how the complete review analysis works. For the restaurant above, I collected 76 reviews and ran them through the system. The discovered review topics were: service, staff, food, price, pizza and bread (apparently, bread is important enough that it is being talked about by many customers!). The scores were generally positive:

Customer sentiment by topic for the Pomodoro restaurant

Here are some review snippet examples with corresponding positive or negative sentiment:

Service. Overall sentiment score: 100% positive

positive “service was good”
positive “service has always been good”
positive “the service was great”
positive “service was pretty decent!”
positive “so service is really good”

Staff. Overall sentiment score: 100% positive

positive “and friendly staff”
positive “the staff in the kitchen are very nice people”
positive “the staff is super friendly”

Food. Overall sentiment score: 84% positive

positive “not only is the food amazing and always up to par”
negative “2/5 because bad food”
positive “the food here is amazing!”
negative “the food was n’t great either.”
positive “food is fresh and steaming hot.”

Price. Overall sentiment score: 56% positive

positive “if you want quality for a decent price, Pomodoro is a way to go”
negative “a price that is a bit high for what I got”
positive “fresh homemade pizza for reasonable prices”
positive “did my son’s birthday upstairs for a very reasonable price”
negative “Although it’s a bit pricey”

Pizza. Overall sentiment score: 89% positive

positive “the pizza was great”
positive “we got the 16 ”abruzzo pizza and it was wonderful!”
negative “we stopped bothering getting pizza there because it is a little more expensive than donnagio’s and not much better”
positive “their pizza is probably one of the best in the area if you like the more thin crust pizza”

Bread. Overall sentiment score: 100% positive

positive “their freshly baked bread was divine!”
positive “the only good part was the bread”
positive “the bread they bring out for free is amazing!”

Evidently, customers are mostly happy with this restaurant, except for the price. The business can use this bit of insight to justify their higher than average prices by describing, for example, their fresh ingredients in the marketing materials.

For the negative reviews, the analysis links the review parts with the complete reviews so the business owner can figure out the exact problem. For example, one negative comment about food was “2/5 because bad food”. Here is the complete review:

After wanting to test this place out for months I had the chance to do so. The experience wasn’t as good as I expected, in fact it was surprisingly disappointing. The outdoor seating looked great but I decided on the upstairs indoor area just to see what kind of restaurant this is. The decor was alright, nothing nice and nothing bad. The menu options seemed really good and I went with the waiters suggestion on chicken with potato in a cream sauce. The dish came with penne in red sauce as a side (which was better than the main). The main had rubbery chicken that had just about no flavor to it, the sauce was watered down so also lacked in the “cream” part of cream sauce. The pepper and potato wasn’t bad, but I didn’t order a 16$ plate of potatoes and green peppers. The worst and most surprising thing was the vibrating floor. It seems like my table was set over an industrial refrigerator. Trying to eat while my chair and table vibrated like a cellphone was not pleasant. 2/5 because bad food, bad dining experience and a price that is a bit high for what I got. Wood not recommend the food or eating indoors. Maybe try the pizza in the outdoor?

Thus, “rubbery chicken” and lack of cream in the cream sauce were the main problems that produced the bad food score. The business owners can use these insights to address any outstanding problems.

Are customer reviews reliable?

As recently as this month, I saw this post in one of the Facebook groups I belong to:

Offer to write and post fake reviews on behalf of businesses

Writing dishonest reviews is a business, and filtering them out is a very important step in the review analysis. There are two ways of approaching fake review identification:

Analyzing the text for unusual features. For example, comments that are extremely positive or extremely negative are suspicious. Also, the more the review mentions the brand being reviewed, the more likely it is that it’s a dishonest assessment, as that can be part of “keyword stuffing”: a search engine optimization strategy.
Analyzing the behavior of the users who leave the reviews. Users that write the reviews “for hire” tend to write them in bursts, indicating ongoing marketing campaigns. Another characteristic is similarity of reviews posted: many reviews written by one fraudster will be semantically very similar, or even exactly the same.

The best fake review filtering systems combine the two approaches, reaching high levels of accuracy in detecting review spam.

NLP tools let us filter out suspicious comments, split up authentic customer reviews and social mentions by topic and do sentiment analysis on each topic separately, thus producing actionable insights and providing useful marketing tools for business owners.

If you need to have such a review analysis completed, please do not hesitate to contact us at info@practicallinguistics.com.

Consumer Choice is a product and service recommendation site that is scheduled to launch soon. We at Practical Linguistics did the review analysis engine for Consumer Choice.

Photo credits: Depositphotos/Aleksandr Filimonov

The Latest on the Chatbot Craze: What Are They, How They Can Help Your Business and Why Using Them Is Smart

Zhenya — Sat, 14 Jul 2018 00:42:54 +0000

What are chatbots?

So what are chatbots all about? They are a general category of programs that interact with people via messaging or speech. Speech bots are Alexa, Siri, etc. In this post I will talk about messaging chatbots. Messaging chatbots use your favorite messaging platform: Facebook, WhatsApp, Slack, Telegram, Kik, etc. The program interacts with the platform users like any regular human: by sending and receiving messages. It might also have some additional capabilities, like reserving restaurant tables, buying tickets for concerts and calling taxis.

Chatbots have been gaining in popularity among all age groups: both millennials and baby boomers use them. In 2017, the number of chatbots Facebook Messenger supports grew to 100K and the global chatbot market is expected to reach $1.23 billion by 2025.

Why are chatbots so popular?

There are several reasons for the chatbots’ growing popularity, with two being most important.

A shift in platforms. While phone and email used to be the most popular way of contacting a business, it is no longer so. Users prefer to be able to send text messages. Getting a text message reminder about an appointment is much more convenient than a phone call: you save time, as reading is quicker than listening, and have a reminder in your phone that you can always go back to. More and more users expect a business to have messaging available.
Spam has not reached our messaging yet, at least not in the epic proportions it has reached our email, and so far companies are determined to keep it that way. Facebook, for example, places strong guidelines on how chatbots can interact with users, and a chatbot can only contact a user if he or she wrote to the bot or subscribed to the page first.

How can chatbots help your business?

What are some ways they can help your business? There are a few. They can automate some of the business’ customer service tasks. For example, a chatbot can answer questions about the hours and location of a hair salon, and even schedule appointments. A chatbot cannot replace live customer service, to be sure, but it can automate some of the more predictable and mundane tasks.

It can also serve as a great marketing tool. Your business can send content to the user via messaging and engage him/her further. Sort of like email newsletters, but a lot better (read on).

Why is using chatbots smart?

In customer service, chatbots can reduce costs by providing at least part of the service and requiring less live customer agents to be available.
Audience segmentation in marketing. It is much easier to segment your audience into different types via a messaging platform than an email newsletter so that you can customize messages to different audiences. This customization can be achieved through asking the user, as well as using information the platform makes available from their profile.
People are much more likely to open a message than an email and read through it. Click-through rates (CTR) for messaging bots are between 15 and 60%, while a 5% CTR for an email newsletter is considered quite good.
People like talking to bots, even when they know that it’s a computer they are talking to. They will personify them and attribute human characteristics to them (known as the ELIZA effect). In other words, a chatbot is almost like their friend, who, if programmed well, will respond in a friendly and interactive way: by sending videos, emojis and stickers along with text. This is exactly what draws people to them.

How to design a good chatbot

While chatbots are coming to take over the marketing world, it is important to remember that a chatbot is still a computer program and has limitations. Making a computer program converse on the same level as a human is incredibly difficult, and that can be rectified by setting users’ expectations.

Narrow down the scope and decide upfront what your chatbot will do. Example of good scope: a chatbot to answer questions about hours and locations and to schedule appointments. Example of bad scope: a chatbot that will replace your customer service department.
Set the user’s expectations upfront: let the users know that they are receiving messages from a program. The interactive personification (ELIZA) effect still exists, even when the users are well aware they are talking to a bot.
Let the user know what the bot can and cannot do and suggest some ways of interacting with it. One of the chatbots I tried out did not understand any of the messages I sent to it and had a good array of answers (such as “Huh?”), but I could not find a question it could answer with actual information. On the other hand, Slackbot does a great job with that: it suggests that it can remind you of a scheduled task, and even shows the format of the message you should use.

If the expectations are set well, and the chatbot can serve some well-defined goals, it can reduce costs for your business, better engage a more targeted audience while connecting with customers in a much more interactive way through their preferred methods of communication.

There are many free and paid platforms where you can build a chatbot for your business (this is a great overview). Platforms are a good solution if you would like to build a bot yourself with little or no coding. You might find, however, that platforms meet one set of your needs, but not others, with it being hard to find one that is exactly what you require. Thus, if you want flexible customization for your particular case it is a better idea to create the bot yourself.

If after reading this post you decide that a chatbot is something you might need, let me know at zhenya@practicallinguistics.com, and we can start a conversation about building one together.

Photo credits: Can Stock Photo / mast3r, Can Stock Photo / Leonidivanov.

LTE: Extracting relevant text and features from HTML

Zhenya — Sun, 15 Apr 2018 02:45:25 +0000

If you have lots of HTML files that you collected for a project, chances are you can’t really use those files as is. Usually, you are looking to extract some information from these files, for example, an article (like in my project), a product description, or user reviews. Depending on the data you want to extract, the algorithm will be different, since different sections of the HTML file will be relevant.

In my project, I had HTML files with letters to the editor. Here is an example letter that I scraped and labeled. As you can see, the letter itself takes up less space than the rest of the content: there are links to social media, related articles, other sections on the website and advertisements. We can use BeautifulSoup to convert HTML to text:

from bs4 import BeautifulSoup
f = open(FILENAME, 'r')
html = f.read()
f.close()
soup = BeautifulSoup(html, "lxml")
text = soup.get_text()

This is the result of this conversion. It includes some spurious JavaScript, which I removed. As you can see, there is lots of text in the conversion that is not the text of the letter; this additional text is mostly links to other pages. The results are clearly unsatisfying, and it would be easiest to cut out unnecessary sections in the HTML file, where we have access to the structure. I explain the logic for my HTML-to-text conversion below. The code is available on GitHub.

First, parse out the html using the lxml package, remove lingering JavaScript and style tags, extract the title and the body:

tree = lxml.html.parse(file)
etree.strip_elements(tree, 'script') #Remove JavaScript
etree.strip_elements(tree, 'style') #Remove style tags
title = page.find(".//title").text
body_list = page.cssselect('body')

After that, find the div in the body that contains the text of the letter. Some sites wrap their letter content in

HTML tags. This is the easiest case, so just return the HTML content inside these tags.

Otherwise, look for the div that contains the most text, but do it in a more sophisticated way than just comparing the lengths of the textual content of all the divs. A div might contain many links that have text, thus having lots of textual content, but it’s not the div we want. The one we want has a large text-to-all-content ratio, where all content includes HTML tags, thus finding the div with lots of text with not many HTML tags:

div_tuples = []
for div in divs:
    ratio = len(pattern.sub("", div.text_content()))/len(b_pattern.sub(b"", etree.tostring(div)))
    div_tuples.append((div, ratio))
sorted_divs = sorted(div_tuples, key=lambda div:div[1], reverse=True) #Sort by ratio, highest ratio first

The simplest solution would now be to pick the top div in the sorted_divs list. However, for example, sometimes the article pages contain divs with comments, which are also divs with high text-to-all-content ratios. Thus, we take the top n divs from the sorted_divs list, and pick the one that contains the most words:

tokenizer = RegexpTokenizer(r'\w+')
max_words = 0
for div in candidate_divs:
    tokens = tokenizer.tokenize(div.text_content())
    num_tokens = len(tokenizer.tokenize(div.text_content()))
    if num_tokens > max_words:
        max_words = num_tokens
        winner = div

This will work in most cases: usually, comments are shorter than the letter itself. However, there are cases where the letter is a short one, and the comments are pretty long. In these cases, the longest comment will be selected as the text of the letter. There are very few such cases, so they won’t affect the results significantly.

At this point we could be done and just use the text of the winning div. It would be nice, though, to save the HTML markup information that could later on be input as a feature in a machine learning algorithm. I wrote a small recursive function to achieve this. In writing it, I made the assumption that the tags will change only on sentence level. For example, consider this letter. The general structure is as follows: title of the letter in an

tag, each paragraph and the signature in its own

tag. Thus, the tags remain the same for each sentence. Of course, there are cases when a certain word inside a sentence is in bold or italics, and the function can easily be modified to save the same tag information on word level, also affecting performance. In my case, I didn’t need this information, since I only used sentence level tags in ML algorithms later on. Here is the function that saves the tag information as a list in a tuple with the sentence:

def tag_text (html, text, cur_tags, ti):
    if (isinstance (html, NavigableString)): #This is text
        ctags = cur_tags[:]
        text.append((str(html), ctags))
        return (text, cur_tags)
    else:
        children = html.contents
        for child in children:
            #Add current tag to list
            if ((not isinstance (child, NavigableString)) #This is an HTML tag
                and child.name in ti):
                tag = child.name
                cur_tags.append(tag)
            (text, cur_tags) = tag_text(child, text, cur_tags, ti)
        if (len(cur_tags) > 0):
            cur_tags.pop()
    return (text, cur_tags)

The next post will be extracting specific information from text using ML techniques. Stay tuned!

LTE: Urban, suburban and rural discourse

Zhenya — Mon, 27 Nov 2017 02:23:43 +0000

After I labeled all the data by topic, I took a quick look at the topic tallies, both overall and by newspaper, keeping in mind the newspapers’ locations. The overall percentages are shown in the graph below:

Some of these stats were expected, others surprising. I expected most of the letters to be about politics and government; and that is indeed the case. Overall, we can identify 4 topics that account for the majority of the letters: politics, government, community and society. Additionally, healthcare, business and education are also important. Several topics are so rare that there are less than 5% of all letters under them (e.g., technology, sports and science).

An interesting finding is about the third most dominant topic. If you remember from the scraping web-sites post, the newspapers are from three different types of communities: urban, suburban and rural. I originally classified them as follows: urban (Chicago Tribune, Daily Herald), suburban (The Citizen, Times Union), and rural (Dubois County Press, Enquirer Democrat). However, after I looked at the topics data, I reconsidered my original classification.

Here are some data about the towns and counties where the newspapers are from.

Newspaper name	Newspaper location	Miles from closest urban area	Town population	Population density per square mile
Chicago Tribune	Chicago, IL	0	10 million	11,868 (town)
Times Union	Albany, NY	0	1 million	563 (town)
Daily Herald	Arlington Heights, IL	27	75,101	7,633 (town)
The Citizen	Fayetville, GA	22	15,945	463 (county)
Dubois County Free Press	Jasper, IN	80	15,038	98 (county)
Enquirer Democrat	Carlinville, IL	60	5,917	55.4 (county)

It is clear from the population density data above why I had originally classified the Daily Herald as an urban newspaper and Times Union as a suburban one. The population density in Arlington Heights is much closer to the population density of Chicago, and the population density of Albany is much closer to the population density of Fayetville, GA, which is a suburb of Atlanta. However, Arlington Heights is 22 miles from the center of Chicago, and the Daily Herald calls itself “Suburban Chicago’s Information Source”. Similarly, while the population density of Albany is not very high, it is nonetheless an urban center.

Additionally, I double checked that Dubois County Press and Enquirer Democrat were indeed from rural areas. According to the US Department of Agriculture classification, Dubois county in Indiana is “Urban population of 20,000 or more, not adjacent to a metro area”. Since it is not inside a metro area, and the population is low, we can consider it rural, or much closer to a rural than a suburban or an urban area. Macoupin county in Illinois, where the Enquirer Democrat is from, is classified as “Counties in metro areas of 1 million population or more”, most likely because of its proximity to St. Louis, MO. However, this county is again much closer to a rural area than a suburban one, since its population density is low. Also, the neighboring Montgomery county is classified as “Nonmetro – Urban population of 2,500 to 19,999, adjacent to a metro area”. Clearly, the towns where the newspapers are located, Jasper, IN and Carlinville, IL, are just that, towns, but they are of a very different character than the other four towns/cities we are considering, and the surrounding areas they serve are more rural than the other four areas.

The graph below shows the topic percentages in each of the 6 newspapers. Pay close attention to the distribution of the top three topics and see if you can find a pattern (click to enlarge).

Excluding the two most dominant topics, politics and government, take a look at the community topic. In the urban newspapers, it takes a backseat to society and education, while in the rural newspapers it is either the third most common topic (Dubois County), or the first (Enquirer Democrat). The suburban newspapers follow the same pattern, where the community topic is the third most dominant one after government and politics.

Here is an example of a letter that was tagged with the topic “community”:

A bumpy thrill ride is here on Dunton
It’s Spring once again in Arlington Heights and vacation season is upon us. I encourage all readers of the Daily Herald to visit “Craters of the Moon State Park” on the 300 block of South Dunton Avenue, just south of the downtown business district. Hiking trails into the craters should be opening soon.

The data shows that the letters about local communities are more prominent in suburban and rural areas. While this is a preliminary analysis with just 6 data points, we can still speculate as to why that is. The most obvious reason is that people are more concerned with local issues in suburbs and rural areas. People in cities have less of a feeling of a community. Another possible reason is letter selection. It is possible that newspapers do not select letters pertinent to the local issues as much in the cities as they do in suburban and rural communities, or they do more selection in city newspapers than in the other ones. This would again reflect reader interest in community topics, depending on the area type. In any case, this is an interesting issue worth more investigation with more data points (and lots more tagging!). It would also be instructive to see if there is a continuum from urban to suburban to rural areas with community interest being lowest in cities and highest in rural areas.

The next post will be about preprocessing the collected data for machine learning. Stay tuned!

LTE: Labeling data for machine learning

Zhenya — Fri, 17 Nov 2017 13:35:05 +0000

For the project of automatically assigning topics to the letters to the editor, I needed labeled data. Sometimes blog posts, or articles in a newspaper will have assigned labels (for example, this post is tagged with “machine learning” and “natural language processing”). However, none of the newspapers I got my data from did that. Thus, I needed someone to go through the letters and label them.

Before tagging the letters, I needed to decide on a list of topics. I went through a small sample of letters, and came up with a pretty comprehensive list that reflected subjects of most current events:

Politics
Healthcare
Society
Media
Community
Education
Government
Environment
Legislation

Technology
Economy
Crime
Religion
History
Military
Science
Sports
Business

Up until I started labeling, my idea was that each letter would have one main assigned topic. However, I quickly discovered that that might turn to be problematic, as this letter illustrates:

Very soon, Congress will be looking at a new version of the American Health Care Act, the “replace” portion of the push to replace the ACA. The so-called “MacArthur amendment” is designed to appease some of the Freedom Caucus by allowing states to waive the community rating provision. This means that while insurers will still have to technically offer insurance to those with pre-existing conditions, they can charge significantly higher rates for them. What this will do in practice is to give the illusion of covering pre- existing conditions, but will price people completely out of the market by charging premiums so high that no one could afford them. Some suggest high risk pools could cover these people, but these pools have proved to be a loser in all 33 states where they existed, losing a total of $2 billion the year before the ACA eliminated them. Tell Reps. Hultgren and Roskam that those with pre-existing conditions must retain the benefits of the ACA.

Which topic would you assign to this letter? Politics, healthcare, legislation? Clearly, all these topics are important in the letter. There were many letters that could fit into more than one category. I allowed for this by tagging each letter with a main topic, and several secondary ones. Thus, we might assign topics such as “Legislation”, “Healthcare” and even “Economy” to the letter above. The best topic for it would probably be “Healthcare reform”, a transient category that exists only for a few weeks when it is relevant. While often we cannot predefine such topics, we can explore them if we use the letters without labels with an unsupervised machine learning algorithm, such as clustering. In this case, topics would have to be discovered from the data.

Also, frequently, the letter would not neatly fit into any of the categories. For example, this letter talks about antibiotics in the food industry, where the author of the letter summarizes his reasons for why antibiotics and animal crowding are not problematic. While it is indirectly related to healthcare, it does not really talk about it. The label I thought was fitting was “Business”. Some new labels would probably have to be created specifically for this letter, such as “Food”.

Now that I had predefined topics and letter data, I had to set up the mechanics of the process. I created a simple web-app that could be used to assign topics and copy the author and title. The app included a frame with the letter and another frame with the labeling options: text fields for the author and title, a drop-down menu for the main topic and checkboxes for the secondary topics. Once a letter was submitted, a new one would be loaded, until all letters were labeled. Here is a screenshot of the app:

There were around 1600 letters that needed labeling. One option I could have used is a crowdsourcing web-site. I did take a look at some of them, Amazon Mechanical Turk, Fiverr, Onespace and Freelancer. However, I decided against them, since it would add significant time overhead, including quality control of completed work. I labeled all of the 1600 letters myself, which was a very educating experience. Here are some of my takeaways about labeling machine learning data:

The taggers should be able to add new labels. In case there are several people working on the task, they should coordinate the naming of the new categories.
Ideally, a team of taggers would label the letters, where the final label would be assigned by consensus between taggers.
Any list of topics may be limiting in a task where the items to be labeled are about current or evolving events. Depending on the application, unsupervised methods that cluster the items together are worth considering.

After I labeled all the data, I took a quick look at the topic statistics. Some interesting patterns emerged; read about it in the next post.

LTE: Scraping web-sites to collect data

Zhenya — Wed, 27 Sep 2017 09:38:01 +0000

In this post, I detail how I collected the data for the letters to the editor corpus analysis project.

First, I picked several web-sites where I could access letters to the editor archives:

Chicago Tribune from Illinois
The Citizen from Georgia
Daily Herald from Illinois
Dubois County Free Press from Indiana
Ellsworth American from Maine
Enquirer Democrat from Illinois
Mount Desert Islander from Maine
Times Union from New York

While I was able to collect data from all of these web-sites, it later turned out that Ellsworth American and Mount Desert Islander limit access to 3 articles per month, so I had to discard their data. In the end it was a nice mix of urban (Chicago Tribune, Daily Herald), suburban (The Citizen, Times Union) and what seems like rural (Dubois County Press, Enquirer Democrat) media.

Each of the newspapers above has a page dedicated to letters to the editor. The structure of most of the pages is very similar and resembles a blog. The title page, usually with the URL something like site/letters-to-the-editor/ contains about 10 or 20 post snippets with links to the full letters in reverse chronological order. It also links other title pages that contain links to older letters to the editor, usually with the URL something like site/letters-to-the-editor/page#, where again there are about 10 or 20 post snippets. Thus, scraping these letters would be a relatively easy task. We would need to find all the links to the letters, follow them, and save the corresponding pages. Then, we would need to go to the next page, and repeat the process.

When I was just starting the project, I wanted to get the data as quickly as possible, and so I tried a Firefox extension, ScrapBook, to automatically download data for two of the web-sites, Chicago Tribune and Times Union. It worked relatively well. However, the only available filter option was to specify the link level, where the first-level links pointed to the newest posts and the second-level links to the older ones. It was also possible to restrict the domain to that site only, which eliminated advertisements and links to other web-sites. However, even if the domain was restricted, the pages still contained all sorts of links that led to unnecessary downloads. Once I received the content, I had to dig through the files to filter out the actual letters.

For the other sites I wrote a simple scraper using the scrapy package. I collected a URL list of the title pages, such as [site/letters-to-the-editor, site/letters-to-the-editor/page2, site/letter-to-the-editor/page3, etc.]. I then specified the format of the links of the letters so that the scraper could save them. Usually the letter URLs were something like site/letters-to-the-editor/date/title.html so it was again easy to specify which links to follow. It was not 100% clean, so sometimes I would get files on the same site that were not letters to the editor, but regular articles; however, the number of these oddball articles was much lower than with ScrapBook.

Using my own crawler instead of ScrapBook taught me a lesson about throttling the crawling speed. ScrapBook has an automatic sleep period between requests. When I wrote my own scraper, I learned the hard way that I needed to specify a long enough sleep time between requests. When I was scraping the Dubois County press web-site, I managed to bring it down with my downloads for a period of time, since initially I hadn’t set the sleep time at all, for which I apologize profusely.

I needed to specify the URL format for each site separately, as they differed slightly (although a couple of them had the same formatting, which was convenient). If I spent a little more time on this, I could have probably included a module to automatically determine the URL structure. However, I had only a handful of sites, and I wanted to get on with more interesting parts of the project, such as author and topic identification. But before I could get on to that, I had to label all of the data. Read about the labeling process in the next post.

LTE: Letters to the editor corpus analysis using machine learning

Zhenya — Tue, 26 Sep 2017 15:38:43 +0000

See if you can determine whether the author of the following texts is male or female.

Text A:

When my children were younger, my goal was to get them into the gifted program or even charter schools, because I just wanted a high-quality option. I was thankful that one of my daughters was accepted to Alexander Graham Bell Elementary School’s gifted program. But my other daughter, who was not as advanced, did not make it in. My oldest child, who attended Alexander Graham Bell, was able to get a great education.

Text B:

Gov. Bruce Rauner rocks the political boat. He has no personal agenda other than decent government. Former governors, of both parties, were all about typical politics: money, power and ego/legacy (see: Barack Obama). Of course they worked well together; they sold their principles to “get what they could,” not what they should. Democrats’ naivete regarding Illinois’ (and Chicago’s) blatant corruption is laughable. Illinois doesn’t deserve our honest and hardworking governor. It likes the stink of the status quo.

Did you pick text A to be female-authored and text B male-authored? That’s what I would have done, going along with societal stereotypes. But of course, I hand-picked texts that are just the opposite. A is written by a man, B, by a woman.

When deciding who wrote which text, which factors influenced your decision? The topic, probably. Maybe how personal it is. In some languages, it is almost trivial to determine the gender of the writer. For example, Spanish has different male and female adjective endings:

estoy cansada 'I am tired-f'
estoy cansado 'I am tired-m'

Slavic languages also have a gender distinction, different verb endings in the past tense for male and female forms:

ja wypiła 'I drank-f'
ja wypił 'I drank-m' (Polish)

English does not have anything like this; however, there are clues that can give out the gender of the author. Sometimes, there are clear indications, such as self-identifying expressions; for example, as a father. Other times, we can use our judgement based on perceived correlations, for example, topic, which can sometimes be wrong, as the two texts above show.

Topic is one of the many questions that arise when we consider differences in female and male writing. Do women and men write about different things? I would argue that yes, they usually do; but I would like to have data to back that up. To find evidence, I collected a corpus of publicly available letters to the editor. While I started with the idea of looking at differences between male and female writing, it evolved into something much larger. This is going to be a series of posts about this project.

Here is an outline of the work that I did:

Corpus collection. Where to get texts with the gender of the author known? My friend had a brilliant idea: letters to the editor. Usually (but not always), the letters are signed, and most of the names can be labeled as female or male. Also, most letters to the editor are freely available on the Internet. I scraped several sites to collect between 200 and 1500 letters per site.
Corpus labeling. I wanted to compare supervised and unsupervised methods for topic identification in my letters to the editor (LTE) corpus. For the supervised topic labeling, I needed to manually tag the corpus. I developed a simple web-app to do that.
Text extraction. The letters were all in the HTML format, and they all contained other information in addition to the text of the letter itself, so I wrote a program to automatically extract the relevant text.
Author identification. Since I already tagged my corpus for author and topic, I wanted to see how well a machine learning algorithm would do in automatically extracting the author of the letter from the file, which is not a trivial task, as it may seem at first.
Automatic topic assignment, supervised. I labeled each of the letters with their topics, and in this stage I used that information to train a classifier to automatically assign the topic given the text of the letter.
Automatic topic assignment, unsupervised. A technique called LDA clustering allows us to group the documents according to their similarity to each other. I used it to group the letters and to compare the resulting clusters to the topics from the supervised topic assignment task.
Expectations and results. What are the theoretical expectations about the data? How do they compare to the results? Which topics are more common than others? What do men and women write about? Are there any differences by location?