The Personal World of Veracity
“Big Data” is often associated with a series of V words: typically Volume, Variety, and Velocity are included on everyone’s lists. IBM, among others, makes a point of including Veracity as the fourth V, and in my opinion this is the single biggest consideration in Big Data.
Volume is a function of storage needs and the ability to avoid unnecessary duplication in a data set. Variety is something that today can be handled with an understanding of an appropriate hierarchy for the sort of data you’re capturing. Velocity is a matter of efficiency, and how much resources one is willing to throw at the problem in whatever technologies can address the first two V’s. Veracity, though, well…. that’s just plain personal!
Veracity in “Big Data” is a measure of quality of the data captured. But what one considers high quality varies greatly depending on usage. Let’s say that I had an app that displays a “Top X” list for specific criteria out of a pre-defined list. The veracity of my data would greatly depend on how well I can capture a metric, how well I can filter out fraudulent metric data, and how responsive I can be to changes in a metric. If I were to feed a targeted email list then I might be most concerned with the accuracy of inferred demographic data, or for survey data one might need to understand how delivery and ambiguity can factor into the likelihood of realistic responses. Do I need to design some fancy adaptive heuristics to determine if Billy Jean really likes high-speed tugboats, or identify when I might be simply asking the wrong questions?
How do I typically handle low veracity? To paraphrase Star Wars’ Yoda, “There is no bad data.” Most any data that could be captured via a service means something within some context. If 93% of requests are fraudulent then at some point this might be nice to know. If we blocked the worst offenders of fraudulent data and our service became twice as fast and twice as cheap, then that affects not only our bottom line, but also impacts how we can price our service to our customers.
Now one could counter-argue that to block those folks from the beginning would save a lot of overhead immediately, but without more than cursory data points to assess intent you could end up filtering out the wrong folks. If we arbitrarily made a stance that no real customer looks at more than five similar products without data to back that up then we could be blocking out mostly high valued customers from our analytics. What assessing veracity over time drives us towards is (introducing another V) Value. Some data is just not that economical to store, or at least store in all its intricacies. It’s hard to make that determination without, well… data. As you get to know the veracity of what you’re storing as personally reflected to you, value naturally falls into place.