Enhancing the Value of Citizen Science Data with COMPAIR Quality Measures and Digital Tools

pavel874
Jul 31, 2023
9 min read

Updated: Aug 1, 2023

In the previous blog we described different types of sensors used in COMPAIR to measure air quality and traffic. Here we would like to delve deeper into the data aspect of our measurement campaigns, focusing on data quality, analysis and sharing.

COMPAIR analytics tools for citizen science data

Sensor data: data interpretation

Compact low-cost air quality sensors used in citizen science projects have simple measurement principles, and as such, their response may suffer from sensitivity to their environment (e.g. temperature, humidity, interfering pollutants), as well as drift during their deployment. Our calibration strategy aims to minimise the influence of sensitivity and drift on data from SODAQ AIR and OnePlanet NitroSense. To achieve this, the sensors will be co-located with high-end reference stations. The sensor data acquired during this period will be used to train a calibration algorithm which will then be applied to the sensors when deployed in the pilot regions by citizens.

Calibration is an important measure to improve quality, however it’s not enough. Additional quality assurance steps are needed to compensate for limitations inherent in low-cost citizen science sensors. We will be following them during our campaigns and would also like to share them with others. The steps include:

Determining the device’s precision
Determining the device’s accuracy
Identifying outliers i.e. faulty or erroneous data

Determining precision & accuracy of citizen science data

A device is considered precise when multiple devices of the same type yield comparable results. A device is considered accurate when it yields results comparable to a high-quality reference measurement e.g. the official monitoring network in your country.

The ideal way of determining precision and accuracy is by conducting a field test prior to your citizen science experiment. The LIFE VAQUUMS project developed a test protocol for this purpose, it contains both lab and field testing protocols. Lab testing can be ignored in light of citizen science experiments. This protocol recommends deploying 3 devices at a reference site for a long enough period (long enough to capture variation in concentrations that is expected during the experiment) and calculating accuracy and precision metrics as described in the document.

This rigorous approach is not always feasible in a citizen science project and might increase timing or costs beyond what is tolerable. A trimmed down accuracy and precision check can therefore be done by looking at how well the data compares between devices (precision) and to the reference (accuracy) at times when concentration levels in the project area are very similar. We recommend using the official monitoring network’s data to look for moments when all sites in a region have comparable levels of pollution. Typically, you can expect these at night (because of the absence of local sources) and on rainy days, but exceptions can occur. You can then do a qualitative assessment of how well the devices match among themselves and to the reference measurements, taking note of an overall under- or overestimation.

Identifying outliers

To identify outliers, have a critical look at your data and ask the following questions:

Does a specific device differ strongly from the others?
Do you see very high spikes?
Is there a pattern in pollution peaks occurring?

Investigate these events and assess whether you can consider these measurements to be valid. Whether you consider a measurement valid actually depends on what you are trying to achieve. For instance, when estimating traffic impact on air quality, you can consider high particulate matter concentrations occurring due to road works as invalid data. It is also well known that on foggy days, air quality sensors will overestimate the particulate matter concentration. This can also be a reason to reject data on such days.

These data interpretation aspects are the guiding principles when performing citizen science labs. It’s highly recommended to involve domain experts in the preparation and preferably execution of the citizen science labs, to avoid pitfalls in data interpretation by non-experts.

COMPAIR dashboards

The project is developing two dashboards: one focused on citizen science data (air quality and traffic), another on the carbon footprint and ways to reduce it.

The Policy Monitoring Dashboard (PMD) enables users to interact with the citizen science data coming from different sensors in order to help users understand and compare how environmental situations change under different actions. PMD is a tool that manages and processes air and traffic datasets and enables the concise reporting, accessing, visualisation and analysis of the data. As a result, the users are able to generate data reports, visualise and share the data. PMD is customizable to specific user requirements and presents an easier way to apply data analysis, interpret data from different sources, and disseminate data appropriately - all of which supports a more informed decision making.

Policy Monitoring Dashboard for citizen science data

Figure 1. Policy Monitoring Dashboard

The Carbon Footprint Simulation Dashboard (CO2) is designed to support specific experiments around carbon footprints or indeed footprint for any chosen air molecule. The visualisation of algorithm results allows users to see and compare how future CO2 and other pollutants’ emissions will change based upon different individual actions e.g. washing during day or night, driving or cycling, recycling food, plastic, paper, glass, etc. The aim is to guide user behaviours towards more environmentally friendly choices like limiting waste and maximising recycling, replacing polluting stoves and ovens with less energy consuming household appliances, opting for a more environmentally friendly car use (car sharing), and so on.

Figure 2. Carbon footprint simulation dashboard

Both dashboards are available at https://monitoring.wecompair.eu/. Additional information is provided in D4.3: Digital Twin CS data integration and prototype 2.

Interpreting sensor data

Air Quality

In spite of assuring the quality of your measurements (through calibration, outlier detection, etc.), just visualising the data collected will not result in clear conclusions. Interpreting what you see can often be the most difficult step in the process. Whether you are a citizen scientist or guiding citizen scientists through this process, the following are some examples of analyses that can be performed either using COMPAIR tools, Excel, a statistical package such as R (and the openair extension) or any combination thereof.

Time series: represent the evolution of your measurements over time. Use this visualisation to explore your dataset and look for phenomena occurring over time
Boxplot: displays the distribution of all measurement values (e.g. for a single device) in relation to the mean, median and certain quartiles. Use this visualisation to compare the overall distribution of pollution (or traffic) levels across devices
Correlation plot: compares the (cor)relation between two datasets. Typically used to look for a (linear) relation between two parameters, e.g. motorised traffic and black carbon
Daily averaged plot: displays the average measurement value for each hour of the day for one or more of your devices. It thus represents the “average day” for that device in the given measurement period. Often divided further into weekday vs. weekend day or even each day of the week. Typically done for the entire measurement period, individual seasons or individual months. Used to look for continuously occurring patterns (e.g. rush hour peak at traffic sites) and differences in patterns between devices/locations, which may indicate the existence of local sources

Many more visualisations and analyses exist. COMPAIR dashboards present specific visualisations tailored to the impact assessment of local measures. A good starting point to explore additional analyses is the aforementioned openair package for R.

Traffic

Traffic data in COMPAIR is coming from the Telraam-sensors. Telraam has been used in citizen science settings before this project and as such, scripts are available for citizen science labs, working on the traffic counting data produced by Telraam. One of those is the format of a “data workshop”, together with sensor-owners as well as residents of the area with knowledge of the local (traffic) situation. This data workshop follows a fixed script with four main blocks:

Recap of the data collection campaign i.e. how many devices deployed, amount of data collected, location of sensors
An introduction to generic data-analysis techniques for traffic counting data
Examples of analysis for the sensors owned by the participants
Interactive component with additional analysis in the workshop, based on discussions with the participants

To highlight some aspects, we explain the data-analysis techniques for traffic counting data used in those data workshops, including what can be learned from it and what are the drawbacks of each approach.

Technique 1: Time-series analysis

A raw data time series plot allows for an “on the glance” view of how traffic has evolved over time. It’s typically quite stable, with logical patterns with reduced traffic during weekends and holiday periods. As such, time series allow us to quickly spot deviations to the patterns that can trigger a deeper investigation. In the example below, it is evident that traffic increased at the end of August which could hint at an event/change.

Time-series analysis of traffic related citizen science data

Figure 3. Time-series analysis

The risk with time series is that (faulty) outliers can obscure the overall picture. Caution is always needed when looking at raw data.

Technique 2: Typical profiles

Typical traffic profiles can be derived when selecting a long time series. For traffic, typical profiles make sense at a daily level for each day of the week and at an hourly level for weekday/weekend. The former allows the interpreter to understand what days of the week are the busiest. Usually, weekdays have similar intensity, with Saturday’s lower level and Sunday’s the lowest of traffic volumes. However, deviant traffic patterns are nonetheless common, for example roads with increased traffic on Saturday due to nearby shopping areas.

For the intra-day traffic pattern, the most common pattern is the one with a short and sharp morning peak and a less pronounced but longer evening peak. Also for the daily profile other patterns are possible such as a peak near lunch or an earlier peak of bikes, signalling the presence of a school nearby.

Typical profiles for traffic related citizen science data

Figure 4. Typical profiles

The risks associated with typical traffic patterns is that you have to select a representative time period for which you want to understand what the typical profile is. This is different during a holiday period, for example. Also, any traffic-related events during this period will be “masked” and invisible. For this reason, it’s important to combine different data analysis techniques.

The citizen science lab, in a workshop format, allows for interaction between researchers and citizen scientists, helping the latter to understand observed patterns, especially when they seem counterintuitive. Citizen scientists’ knowledge of the local traffic situation can be extremely valuable in helping researchers to interpret the data. Such workshops represent a “win-win situation” for researchers and interested and engaged citizens. They also portray the essence of citizen science where everyone benefits and participants who are interested are able to join not only during data collection, but also at different others stages e.g. problem formulation, interpretation of findings.

Technique 3: Comparing periods

The third generalised analysis technique for Telraam traffic data builds on the previous one: comparing typical profiles of two distinct periods. This is particularly useful if an intervention or change has happened which is expected to influence traffic patterns. In such a case, you typically select a period of a few months before the intervention (baseline) and a few months after the intervention. In the figures below, you’ll see daily typical traffic for two periods (left) and typical speed profiles for the period before traffic calming measures were introduced. Such a comparison allows the interpreter to understand if the measures have had any effect.

Comparing periods for traffic related citizen science data

Figure 5. Comparing periods

The same risks apply for this technique i.e. the selected time interval should be appropriate to derive a typical traffic profile and any changes with the selected period are masked.

The Telraam-platform can generate these graphs in an MS Excel format and these can be generated on the spot during workshops, taking just a few minutes to generate. The graphs can facilitate interaction between sensor owners and local residents, also to investigate specific streets segments for specific periods, to verify changes to traffic patterns, based on local knowledge of the participants, with data.

APIs

API stands for “Application Programming Interface” and is used in this context for access to data directly from the source (i.e. cloud database) to extract and process data for interpretation and in some cases design of other indicators and new data dashboards. Obviously, interacting with an API requires programming skills, so citizen science labs working with APIs will have a niche audience of programmers. In this section, we briefly describe the different API options in COMPAIR.

COMPAIR data manager

To facilitate groups of citizens to get as much value as possible out of the sensors and the citizen science project as a whole, the data used in the different tools and dashboards will be made available through an open API. As such, the data can be used in dashboards and/or tools built by citizen scientists. Choosing the OGC Sensorthings API between different components in the COMPAIR toolkit facilitates this, as it’s the de facto standard for all things sensor, it’s well documented and there is great tooling available e.g. Fraunhofer’s FROST server.

Telraam API

While Telraam data is accessible via COMPAIR’s data manager, there’s also an open API directly to the Telraam data. This API has more “end points” (i.e. data interactions with the database), allowing more options to design own data productions or analysis.

Telraam’s API is well documented using PostMan’s documenter.

The API includes definitions of terms used, so no prior knowledge about Telraam or traffic data in general is required. Also, it includes Python code snippets which can be used by users not experienced at all with coding. The Telraam API is mature and has been used extensively in the past. As such, it is streamlined and fine-tuned for immediate use. The Telraam Talks platform includes forum discussions with other API users to exchange experiences.

COMPAIR citizen science labs using the Telraam API could be a hackathon for a niche audience, providing challenges to produce specific outputs. However, given the well documented API and easy to understand data, the Telraam API can also provide a generic use case, training citizen scientists and/or students on how to work with datasets, potentially as an introduction to coding (e.g., Python), directly with real world data.