Crowdedness Monitor

Introduction

The Crowdedness Monitor is a hardware device that can be installed in public places such as libraries to monitor its real-time crowdedness, which can be helpful for Waterloo students to decide the best place for studying, eating or exercising. The crowdedness of a location is reflected by the number of unique MAC addresses present, which is obtained by monitoring the packets that Wi-Fi devices transmit in the IEEE 802.11 data link layer.

Background Research

Research shows that every Wi-Fi device, when probing for available networks, will unavoidably expose its MAC Address, the unique identifier of network devices, in the data link layer, which can be probed by specialized devices easily [1]. In this way, the MAC addresses of Wi-Fi devices within the range of the probe can be obtained even without the devices themselves being connected to any network, making it an ideal measurement of crowdedness.

To address the privacy issue this may cause, many manufacturers have introduced a technique called MAC Randomization, which allows devices to use a randomized MAC address. However, these randomly generated MAC addresses will have the universal/local bit set to 1 and can be easily distinguished and excluded.

Implementation

Figure 1. The overall architecture of programs

Figure 1 illustrates how each module interacts with one another. The whole system can be divided into three parts: backend, frontend, and embedded systems.

Hardware / Embedded Systems

The embedded systems comprise of Raspberry Pi Zero W devices, which collects nearby Wi-Fi devices and transfers data to the backend server.

The cron daemon will invoke the script periodically. The script collects packets by using a Wi-Fi adapter set to monitor mode with tshark. The command used is

tshark -l -Ini ${CARD} -o "gui.column.format:\"Source\",\"%us\",\"Destination\",\"%rd\"" -a duration:${SCAN_TIME} > $filename

A python script is written to parse the output of tshark and extracts MAC addresses from it. After that, the data is reported to the backend.

Experimentally, another script is written to control a MAX7219 LED Matrix, making it display the real-time information of buildings.

Backend

The backend server is written with Flask and SQLAlchemy. It responds to API requests from the frontend and embedded systems and stores data into a MySQL database.

It also runs machine-learning algorithms (Section 3.4) to predict future trends from historical data.

Frontend

The frontend is a react web app served by Nginx which fetches data from the backend API and displays it in a user-friendly manner.

Algorithms

Crowdedness: Crowdedness of locations is represented by a number between 0 to 100. This is calculated by the formula below:

where c is the crowdedness value, d is the current number of unique MAC addresses, h is the 95th-percentile of all historical numbers of MAC addresses and l is the 5th-percentile to avoid the influence of extreme points and outliers.

Clustering:

The K-means algorithm from Sklearn is used to aggregate and summarize hourly crowdedness every day. Preprocessed historical data, inputted into the K-means algorithm, can generate a prediction of the crowdedness of every hour of every weekday (Sunday, Monday, etc.).

ML algorithm (Lightgbm):

Lightgbm is a powerful tree-based model that can automatically rank the importance of input features. The most difficult part of applying Lightgbm is finding the optimal features. The most important feature for this Time Series problem is the “time” feature. Also, based on the result of the clustering algorithm, it is discovered that the weekday also influences the crowdedness and thereby is chosen to be used as one of the features. Additionally, hours and minutes are also used as features and are used as sparse matrix.

Group members’ contribution

Zuomiao (Zona) Hu:

Responsible for the project proposal, implementation of backend features such as time localization and input sanitization, report writing and bug fixes.

Zuoqiu (Robert) Liu

Responsible for designing and implementing the Machine Learning models as well as data analysis.

Song (Tony) Sheng

Assisted in backend development including data acquisition and organization. Also responsible for proofreading the report.

Anhai (Mark) Wang

Responsible for the UI/UX design and implementation of the frontend React web app, as well as designing the presentation slides.

Shiyuan (Harry) Yu

Responsible for the design of the system architecture and the development of embedded systems. Also helped with backend development by implementing the initial framework.

Final Product Evaluation

To check the final source code, you could go to this link.

Most of the planned features are implemented, including packets analysis, crowdedness, prediction, backend server, web interface, etc. In addition, an LED Matrix is also added as an experiment.

Figure 2. The number of Unique MAC Addresses collected from Davis Center library.

Figure 3. Number of Unique MAC Addresses in the Davis Center library (11/19 to 11/20)

As shown in the figures, our system can reflect the crowdedness of the two libraries. Figure 2 shows that there are fewer people at the DC library during the weekends. Figure 3 illustrates that after the library closes, the busyness drops greatly, which is expected.

Figure 4. Performance of the test set (Prediction of DP Library)

Figure 5. Prediction Machine Learning Algorithm Evaluation (DC library)

The root-mean-squared error (RMSE) for the prediction algorithm is from 40 to 45, which is sufficiently accurate for generating the crowdedness. It would result in around 10% of error in the final crowdedness. Figure 4 and Figure 5 show that the prediction has very similar tendency compared to the actual data.

Design Trade-offs

A tradeoff has to be made when choosing the scanning interval of the probe. A short interval leads to more accurate predictions but are more energy-consuming. A longer interval, on the contrary, consumes less energy yet produces less real-time results. The finalized version of the device adapts a 5-minute scanning interval which was experimentally determined to ensure maximum possible energy efficiency while hardly compromising real-time accuracy.

It is also noteworthy that there might exist a strong positive correlation between the number of probing devices employed and the accuracy of the data collected. Since each detector has a limited range, a considerable number of devices have to be installed to eliminate blind areas in an ideal situation. This, however, is not practical given the immense cost and labour required to implement such a large-scale probing network. As a result, only two units are installed, one of which are placed in Davis Centre Library and the other in Dana Porter. The units are installed in typical study area so that the overall crowdedness can be estimated, assuming that people are distributed uniformly.

Future Work

This project can be further improved by adding more probing stations in different locations to achieve better coverage. More devices installed on each floor can also provide a more accurate result.

Another thing worth trying is the integration of a noise detector into the devices, which is a potentially good source of detecting crowdedness.

Furthermore, accuracy might also be enhanced by users’ real-time feedback. It will be more helpful if the website had a section for user comments.

The prediction algorithm might be further improved by applying Deep Learning.

On the other hand, the development of mobile apps on Android and iOS platforms can make access to the app easier.

References

[1] J. Martin et al., “A Study of MAC Address Randomization in Mobile Devices and When it Fails,” Proc. Priv. Enhancing Technol., vol. 2017, no. 4, pp. 365–383, 2017.