Modelling movement

Where do people move

The data is available to us at several granularities: at both daily and hourly levels and at region and full city levels. The data is also subsetted based on gender and age range, among other factors we did not consider (largely because the privacy protection resulted in a large amount of missing data). While inspecting the data, we realized that there was a significant amount of overcounting occurring. That is, the data aggregated from the hourly and region level data resulted in significantly more people counted than the total number of people in the city for a given day. This is due to the fact that within any given hour if an individual connects to a cell phone tower in two different regions, he or she will be counted in both. This presents an interesting modeling case in which we have added data about possible movement patterns embedded in the total data, but it is difficult to extract the number of people in any given region at any given time. We chose two models to deal with this problem: a linear-program based model which assigns people to specific physically possible routes through the city (regions passed through within any given hour), and a spatial Gaussian process smoothing model, which should give us the general feel of popularity of a given area abstracted from the absolute numerical counts.

Linear Programming Model

Linear programming is a method to calculate the optimal allocation between elements in a linear objective function subject to linear equality and inequality constraints. That is, linear programming methods solve a problem of the form:

maximize c^T
subject to Ax=b
and Gx < h

The key insight behind the application of the linear programming model is that the double counts and the total city count are related through the paths that people followed over any given hour. In particular, the total number of people overcounted is exactly equal to the number of regions passed through by individuals beyond the first region they appeared in. Therefore, we can enumerate possible paths through the city and attempt to map people onto them by setting up a linear programming problem that weights shorter paths as more likely and is subject to the constraints that the total number of people in any region or hour must sum to the ground truth provided by the whole city data.

By considering the structure of the map regions, we enumerate all possible directed paths. That is, because region 6 is connected to regions 8, 5, 2, and 7, {6,8}, {6,5}, {6,2}, and {6,7} are legal paths whereas {6,3} is not. If we assume that in general people are more likely to have been on shorter paths, this leads to the construction of our objective function for any given transition hour as:

Where x₁ is the vector of dummy variables for all paths of length 1, x₂ is the vector of dummy variables for all paths of length 2, and x₃ is the vector of dummy variables for all paths of length 3 and a₁ > a₂ > a₃.

The simplest form of the problem is to enumerate dummies for all possible paths, with the resulting coefficients being the number of people who were on such a path in that hour (i.e. passed through those regions of the city) subject to the constraints that for every region, the total number of people who passed through that region has to be equal to the number of people logged in that region during that hour and the total number of people in the city (the sum of all of the coefficients) has to sum to the total number of unique people logged in the city. This results in a general formulation of the constraint matrices A and G as large, sparse 0,1 matrices. Then, we can see this will allow for the double counting, as a path {6,5,10} will contribute one person each to the counts for the regional constraints but only a single person to the total unique persons constraint.

Now, note that this basic form of the problem is not in general well-specified. For a simple example, if we know that there are 2 total people in the city and we have the region counts 1:1, 2:1, and 3:1, we cannot distinguish between the solutions {1,2}:1, {3}:1 and {1,3}:1, {2}:1. In fact, when we run the problem with only these constraints, there are a huge number of feasible solutions and we do not attain a reasonable dispersion of people throughout the city. However, the structure of the datasets we were provided naturally leads to a large number of possible additional constraints in a linear programming environment. In particular, any given subset (gender, age range, country/province) provided should sum over all possible paths including that variable to the total number reported in that dataset.

We computed these linear programs on a pairwise hourly level subject to the additional temporal constraint that the number of people moving to a region in any given hour (that is, on a path that ends in that region) must be lower than the total number of people who are in that region in the next hour. People who leave the city are assumed to have left from a single region. Additionally, we limit people to have been in at most three regions in any one hour. These simplifications lead to a linear program that is quite well-specified and leads to a reasonable dispersion of people over paths in any given hour
Gaussian model

Another option for interpreting the noisy, overcounted data is to focus on the general direction of the number of people in any given region relative to that region at other points in the timeframe and relative to other surrounding regions. For this we use kriging, or Gaussian process regression, a spatial interpolation method in which data samples are interpreted as being noisy estimates from an underlying random process with prior covariances.

In particular, we know that the data samples we have are stochastic estimates of the true number of people in a region at any given time, and we expect that because people must flow through the city smoothly (that is, they cannot teleport from one region to another), the covariance between the population count at different regions should increase as the distance between those two regions decreases. We can check that this assumption is true by plotting the covariogram of the data points:
References

Using Mobile Phone Data to Analyze Movement of People

Dan, Y., & He, Z. (2010). A dynamic model for urban population density estimation using mobile phone location data. 2010 5th IEEE Conference on Industrial Electronics and Applications. doi:10.1109/iciea.2010.5514844

Song C, Qu Z, Blumm N, Barabási A-L (2010). Limits of predictability in human mobility. Science 327(5968):1018–1021

Linear Programming / Markov Chains

Bernstein, G., Sheldon, D. (2016). Consistently Estimating Markov Chains with Noisy Aggregate Data. arXiv:1604.04182

Luedtke, J., Ahmed, S., & Nemhauser, G. (n.d.). An Integer Programming Approach for Linear Programs with Probabilistic Constraints. Integer Programming and Combinatorial Optimization Lecture Notes in Computer Science,410-423. doi:10.1007/978-3-540-72792-7_31

Gaussian Processes

Flaxman, S. (2014). A General Approach to Prediction and Forecasting Crime Rates with Gaussian Processes

Geoff Bohling,2005,Kriging

Hartikainen, J., Jylänki, P., Riihimäki, J., Tolvanen, V., Vanhatalo, J., & Vehtari, A. (2013). GPstuff: Bayesian modeling with Gaussian processes. Journal of Machine Learning Research, 14, 1175-1179.

Visualisations

URBAN LENS. (n.d.). Retrieved March 31, 2017, from http://senseable.mit.edu/urban-lens/

Yau, N. (2017, March 10). A Day in the Life of Americans. Retrieved March 31, 2017, from http://flowingdata.com/2015/12/15/a-day-in-the-life-of-americans/

C. (n.d.). Living Cities - A shared cartographic experiment from HERE and CartoDB. Retrieved March 31, 2017, from http://livingcities.here.com/#cities/london/51.505/-0.188/11/

Big Data Visualization & Society - Riyadh. (n.d.). Retrieved March 31, 2017, from http://dataviz-riyadh.mit.edu/projects.html

Stitch Fix Algorithms Tour - (n.d.) Retrieved May 9th, 2017, from http://algorithms-tour.stitchfix.com/

Modelling Insights

Where do people move

Screenshots from the model and the insights we gained

Tuesday 2PM

Tuesday 6PM

Saturday 8PM

Saturday 10PM

Error map with telecom only

Error map with telecom and parking

Motivations for Movement

BEHIND Reasons why

As well as understanding how people move through the city, we explored some of the reasons why. Initially we explored using geolocated social media posts including Twitter, Instagram and TripAdvisor as indicators of activity. Although the posts are too sparse to be a useful signal of activity level, they can be used to infer the types of activities associated with each region of the city. We started by developing a representation of two time periods (Tuesday afternoon and Saturday evening) in the TripAdvisor and Instagram data. We analyzed all of the checkins on the TripAdvisor app bucketed by region. The hashtags and comments of Instagram posts bucketed by region were analysed using tf-idf, a method of representing importance of words based on their occurrence in each post relative to an entire corpus (more info here). This has the effect of weigh less frequently occuring words. Stop words were also removed. From this analysis we extracted a representative set of hashtags for each region and separately chose an Instagram picture which represented the activity in each region. These analyses resulted in a general view of the city from the social media point of view: TripAdvisor data pointed out the general interests of tourists, showing how they move from the main monuments and attractions during the day to the restaurants in the evening. In a similar way, the Instagram data gave an insight into the most significant places in each of the telecom regions, highlighting the main attractions of the city for young people during the day and during the night. It is also clear that the pictures and activities dramatically change between seasons.

Top two hashtags. Tuesday

Top two hashtags. saturday

TRIPADVISOR CHECKINS - TUESDAY AND SATURDAY

Saturday

Tuesday

Como Smart City
for Satisfied citizens

About the project

Como Smartcity for smarter citizens

Como is a touristic city

Tourists per year

Como is a beautiful italian city

Time and space

Modelling movement

Where do people move

Linear Programming Model

Gaussian model

References

Modelling Insights

Where do people move

Tuesday 2PM

Tuesday 6PM

Saturday 8PM

Saturday 10PM

Error map with telecom only

Error map with telecom and parking

Motivations for Movement

BEHIND Reasons why

Top two hashtags. Tuesday

Top two hashtags. saturday

TRIPADVISOR CHECKINS - TUESDAY AND SATURDAY

Where do people go after 8pm?

MEETING WITH THE MUNICIPALITY

Project by

Faculty

Teaching Fellow

Universities

Como Smart Cityfor Satisfied citizens

About the project

Como Smartcity for smarter citizens

Como is a touristic city

Tourists per year

Como is a beautiful italian city

Time and space

Modelling movement

Where do people move

Linear Programming Model

Gaussian model

References

Modelling Insights

Where do people move

Tuesday 2PM

Tuesday 6PM

Saturday 8PM

Saturday 10PM

Error map with telecom only

Error map with telecom and parking

Motivations for Movement

BEHIND Reasons why

Top two hashtags. Tuesday

Top two hashtags. saturday

TRIPADVISOR CHECKINS - TUESDAY AND SATURDAY

Where do people go after 8pm?

MEETING WITH THE MUNICIPALITY

Project by

Faculty

Teaching Fellow

Universities

Como Smart City
for Satisfied citizens