Big Data and IoT Data Analysis – Who Has Privacy?

By Peter Vogel on June 10, 2014

Given the size of the Big Data and IoT (Internet of Things) it is clear that we really have no privacy, just consider Justice Sonia Sotomayor’s opinion that “GPS monitoring generates a precise, comprehensive record of a person’s public movements that reflects a wealth of detail about her familial, political, professional, religious, and sexual associations.” Justice Sotomayor’s opinion was in the 2012 US v. Jones case in which the US Supreme Court ruled 9-0 that the 4 weeks of GPS data about an alleged drug dealer’s location obtained from a GPS device attached to his car without a warrant, violated the defendant’s Fourth Amendment guarantee of privacy.

A recent issue of the New York University Journal of Law and Liberty included an article about the privacy under the Fourth Amendment entitled “When Enough Is Enough: Location Tracking, Mosaic Theory, and Machine Learning.” The NYU article was written by Steven M. Bellovin (Professor, Columbia University, Department of Computer Science), Renée M. Hutchins (Associate Professor, University of Maryland Francis King Carey School of Law), Tony Jebara (Associate Professor, Columbia University, Department of Computer Science), and Sebastian Zimmeck (Ph.D. candidate, Columbia University, Department of Computer Science).

The NYU article included a description of “Unsupervised Machine Learning” which “automatically finds dependencies, correlations, and clusters in the data without requiring any significant human intervention. More specifically, it could perform the following operations”:

Clustering: In clustering, a system automatically finds groups of users in the dataset that appear statistically similar. For instance, certain individuals may show a pattern of visiting churches on Sundays while others stay home during that time. After application of a clustering algorithm, it becomes relatively easy for a human investigator to observe prototypes from each cluster and figure out which group it represents (for instance, followers of a particular faith, e.g., Christians). The number of groups to be extracted can be fixed (i.e., find the 5 most important groups) or can be automatically estimated. The groupings could be disjoint, overlapping, hierarchical, or nested in various ways. For instance, sub-groups of religious activity (Baptists, Roman-Catholics, Lutherans, etc.) could emerge under a larger umbrella group (Christians).
Detection: Given data about individuals as an unbiased sample of the population, a detection system recovers a probability distribution, which says how an individual likely behaves under this sample. This permits an investigator to flag anomalous users in the training data (and in future data) as individuals with a score that is lower than some reasonable threshold. Alternatively, it is possible to identify the handful of users who had the lowest scores as outliers, for example, in a location dataset those who do not exhibit regular location movement. One natural example of an outlier is the mail carrier who spends the workday going door-to-door delivering mail. This is an unusual commute pattern relative to the rest of the population.
Visualization and Summarization: Another application of machine learning is visualizing trends in “big data” and highlighting important aspects in it. While each person’s record may contain thousands or millions of bytes of information, a human investigator can only visualize projections of the data in two or three dimensions. Machine learning, however, finds low-dimensional embeddings, which summarize the original data with minimal distortion. For example, the similarities or distances between pairs of visualized low-dimensional embedding-points could be almost equal to the similarities or distances that were measured between pairs of original data points. Alternatively, only the key measurements in the original data points are preserved. For example, from the thousands of latitude and longitude coordinates a user visited that are stored in, it is possible to extract one or two important locations such as the user’s home or place of work.
Inference: One of the most powerful unsupervised machine learning techniques is arguably probabilistic inference. In particular, machine learning is able to find dependencies in parts of a collection of data gathered about users. For instance, if we have observed two types of information for many users, say, their location history and web-browsing history, a machine learning system can learn the dependence and correlations between locations and browsing. This allows the system, for example, to fill-in likely browsing patterns for a new user even though only location history for this user was available. Put another way, we can predict a user will probably visit the website espn.com frequently if that user has frequently attended sports events at stadiums.

Of course the article includes a discussion of Supervised Machine Learning which is “more laborious to create since it requires human an-notation effort while unsupervised learning is more of a pure data collection exercise. With supervised learning, we can perform the following operations with varying degrees of accuracy”:

Classification: One of the most basic supervised machine learning operations is classification, that is, the identification of a category for a new observation. In addition to collecting data, about an individual, classification also requires that we annotate individuals with a discrete label, Collecting such a categorical variable, about an individual often requires some effort, expense, or a need for the subject to volunteer information about themselves. For example, in addition to collecting location data, one may survey a small portion of the population and ask them to report their occupation (student, construction worker, taxi driver, etc). Then, having obtained such labels from the survey, it is possible for a machine learning system to automatically label other individuals using only their location data.
Regression: While classification involves obtaining a discrete label, for an individual, regression assumes that the discrete label is a scalar. For instance, instead of a category (such as occupation), we may collect the income that the individual received last year as a numerical value. Machine learning then learns a good prediction function from training examples to accurately estimate the salary, of other individuals directly from their location data. For instance, by getting location data from someone who lives in an expensive neighborhood and works in the financial district, it would be possible to estimate a high income level.
Prediction: In prediction, the output, is either discrete (as in classification) or continuous (as in regression), but is also specifically a quantity that is only available in the future after the input raw data, is observed from a user. For example, may be the location (latitude and longitude) that the user will visit tomorrow for lunch. Alternatively, may be the party (Republican or Democrat) that a person will vote for in the next election. By observing a population of users for some time, it may be possible to predict that user will likely go for pizza at the mall in his or her next lunch break. Prediction may help an advertising company determine what ad to target on a mobile device by delivering a relevant message (for instance, to lure the user to a new pizza establishment in the vicinity of his or her next lunch location).

Even with the FTC’s recent call to Congress to control Big Data it seems likely that we have less privacy given the size and scope data analysis of Big Data and IoT.