How To Clean Up Dirty Data (and Polish PR)

Katie Paine
Katie Paine

There is no bigger cause of embarrassment, stress and bad decisions in the communications profession than results that you can’t explain. With big data comes bad data, and most tools leave it up to you the communicator to figure out what’s good data and what’s bad data. Here are some tips on how to make sure your data is as immaculate as possible.

Check for dupes. The easiest bad data to spot is duplicate items in your database. Lets assume you’ve got a database that is in Excel. Click on Sort & Filter and select Custom Sort. Look for instances where you see the same story in the same publication on the same date. Chances are pretty good, that’s a dupe.

Now, you need to figure out why it’s in there. Some monitoring companies use multiple searches or “subscriptions” to organize their searches. This methodology frequently leads to duplicate articles, especially if you’re searching for more than one brand. Also some outlets, especially the wires, tend to update a story multiple times.

Check for bad content. Another cause of dirty data is bad search strings. Start by using the “Find & Select” tab in Excel and search for Viagra. Seriously. In one database for a restaurant we searched for “Viagra” and eliminated about 25% of the mentions. Also, look for calendar listings, weddings, police blotter notifications and anything with similar but inappropriate terms.

For example, we did a search for Johnson & Johnson and turned up a remarkably revealing discussion of sexual innuendo and puns that had very little to do with the products the medical supply company provides.

Make sure you’re analyzing all the relevant data and not the junk. Too often, press releases will be included simply because the company’s boilerplate says “...has worked with clients such as…” The fix? Do a search for “PRNEWSWIRE,” “PRWEB” or just “WIRE,” and eliminate any press releases.

Another area to watch: Many of the broadcast clipping services now provide radio, including transcripts, from National Public Radio (NPR) as well as individual shows like Marketplace and Living on Earth. Unfortunately, they haven’t figured out that those statements at the end of the broadcast are paid underwriting credits.

They should not count as “earned” media. Nor should you include any blog that has on its home page, “I was paid or compensated to write this.” Sort all blogs by frequency and tone. If you get one blogger who is consistently positive and posts regularly, check him or her out. They’re not all paid, but you’ll want to double-check.

The other side of the content equation is to make sure that you are getting all the data you expect. In one client’s case, its coverage suddenly dropped by 50% from the previous quarter.

Turns out that the client had switched vendors and the new vendor was using such narrow search strings that it only included about half of the relevant items. Solution: Eliminate some of the search term restrictions.

The data does matter. When you’re working on monthly or quarterly reports, it’s important to make sure that the data you are reporting actually appeared in that time frame. Check to make sure whether your vendor is reporting “day of publication” (correct) vs. “day of collection” (irrelevant).

Is that really positive? Look at a representative sample of the data (about 10%) and see if you agree with how it’s been coded. Ideally, you should have two independent coders review to see how often they agree.

If you don’t have the time or the resources to do that, at least review the coding. If you don’t agree with 80% of how it’s coded, you will want to go back to the system or the vendor and modify the words the sentiment engine is using to determine positive, negative and neutral.

Many of these bad data problems can be reduced or eliminated by adhering to the measurement standards and the tools that these standards include (like the transparency table and sample code book).

Make sure any of the vendors that you use comply with the standards. If you’re not using outside vendors, apply the standards to your own work and try to eliminate any confusion.

CONTACT:

Katie Paine is CEO of Paine Publishing. She can be reached at [email protected]. Follow her on Twitter, @queenofmetrics.

This article originally appeared in the July 28, 2014 issue of PR News. Read more subscriber-only content by becoming a PR News subscriber today.