Ensuring data quality for LaborEdge: a case study

In August I did a data quality project with LaborEdge, automating part of their data standardization process using Natural Language Processing (NLP) techniques. Data standardization is part of a larger host of tasks ensuring data quality. In this post I am going to tell you about the project, data quality and how NLP can help raise it.

Data quality has enormous impact on businesses of all sizes. All businesses have data, ranging from a CRM database to complex business process data, and the cost of low-quality data can be very high.

NLP techniques can help to improve data quality

Many business professionals, when interviewed, say some variation of “I don’t trust my data. Low quality data can include inconsistencies, such as different categories assigned to the same product, duplicate records, or incomplete data. Lack of confidence in the data slows the businesses down, largely due to the time spent reconciling and verifying data against expectations (Modern Data Strategy, Fleckenstein and Fellows). The business losses stemming from poor data quality can be enormous. In 2013, the 140 businesses surveyed by Halo Business Intelligence estimated an average loss of $8M due to low quality data. On the other hand, higher data quality leads to business improvements:

  • 10-20% reduction in corporate expenses
  • 40-50% reduction in IT costs
  • Up to 40% reduction in operating costs

One example of how poor data quality can affect the bottom line is duplicates in the data. For example, in medicine this a very real problem with an average cost of $50 per duplicate record.

Another aspect of data quality is ensuring that data is normalized according to internal or external standards. Extracting information from outside data sources requires that it be standardized if the resulting information is later used for a uniform purpose. If it’s not, having unstandardized data can lead to several problems:

Standardizing data is very important
  • Incorrect product or service matches, directly affecting the bottom line
  • Errors in reporting that can affect decision making
  • Mistakes in compliance and adherence to business rules

Natural language processing techniques are instrumental in standardizing text data, and this is the data quality aspect with which I helped LaborEdge, a company that makes healthcare staffing software.

LaborEdge bases its system for classifying duty shifts for healthcare professionals on time of day and shift duration. However, the shifts themselves are compiled and input by staffing agencies, which can mean tremendous variation in the format of the shift data. For example, depending on the agency, shift data can include any, all, or none of the following: shift start, shift end, number of shifts per week, number of hours in one shift, usual time of day for the shift, and be in either military or standard time. Some examples of input phrases are “0700 – 1730”, “10 Hour Days”, “12hr d”, “7a” and “12H Nights:  7:00 PM – 7:00 AM”.

LaborEdge’s classification system is necessary, therefore, for their job-matching software, since it standardizes the ‘shift’ values for the software to be able to match jobs and candidates. Some of the difficulties that LaborEdge encountered with a previous classification system were:

  • Inconsistent and unnecessary ‘shift’ classes caused by manual classification
  • Complex matching rules for regular expression. Updating one rule could potentially break other ones, so debugging became difficult
  • Inconvenience because they needed a programmer to update the matching rules

To solve this problem, I first took a detailed look at the data to better understand the underlying processes and the existing classes. Then I designed a system that used a combination of NLP approaches involving both machine learning and regular expression matching, with matching rules being a small but robust part built on top of machine learning. Since the number of entities to be parsed out was finite and the entities themselves well-defined, the first stage was training an entity extraction model that identified shift start, shift end, shift duration, time of day, and shifts per week from the phrase. Whatever it didn’t catch or didn’t get right was processed by the regular expression rules.

The result is an automatic system that works without the need for updates or manual intervention and that improves the quality of the data by standardizing it. What I would do to complete this is to use the output of the rule-based system as a labeled dataset to create a machine learning system instead of the rule-based one, so that less time is spent on modifying the rules.

Data quality is of utmost importance for business processes and improving data quality cuts costs by 40-50%. Standardizing data coming in from external sources is very important, as it allows your system to extract information of interest and improve reporting and compliance. In the LaborEdge case I used a combination of NLP approaches to create a robust shift description standardization system that eliminates the need for time-consuming manual updates and creates higher quality data.

If you would like help with data quality and standardization of text data, please send me an email to zhenya@practicallinguistics.com and we can discuss the details.

Leave a Comment

Your email address will not be published. Required fields are marked *