Estimating cycling rates using Strava

Detailed and reliable data about the numbers of people walking and cycling can have a range of benefits. These benefits include providing evidence to inform transport planning decisions, supporting the evaluation of interventions designed to increase active travel, and providing baseline frequency data that allows calculation of collision rates (eg n collisions per 1,000 cyclists or pedestrians). A further use of data about pedestrians and cyclists is in the assessment of what environmental and infrastructure features influence the decision to walk or cycle. We are particularly interested in whether lighting can encourage active travel after-dark, as we have previously shown how darkness significantly reduces the number of people walking and cycling.

Existing sources of data about numbers of active travellers include behavioural questionnaires, travel diaries, manual counts at a sample of locations and on a sample of days, and automated counters at specific locations. All these methods have limitations however, for example they may only provide data about a sample of people, a sample of locations, or a sample of days. To assess the impact of lighting on cycling and walking rates, we need data that are on a street-by-street basis (so lighting on different streets can be compared), and that provide details at a fine temporal scale such as by the hour (so periods of darkness and daylight can be compared). GPS-recorded crowdsourced data about walking and cycling trips offer great potential as they can meet these two requirements related to spatial and temporal detail. The most popular trip-recording application is Strava, which allows users to record time and position details about their trip through any GPS-enabled device, including a smartphone.

Although the number of people using Strava is large and continuing to grow, its users still only represent a small sample of the total number of pedestrians and cyclists out there. There is also a commonly-held belief that Strava is generally used by a niche group of sporty or competitive active travellers, and therefore may be a biased sample. We compared the age and gender profile of Strava cyclists for the Tyne and Wear area, using data provided by the Urban Big Data Centre, with the age and gender profile of Census respondents in the Tyne and Wear area who travelled to work mainly by bicycle – see Table 1. There were a higher proportion of men amongst the Strava cyclists compared with that suggested in the Census data. There were also a larger proportion of Strava cyclists in the 35-44 age bracket. However, the differences between the Strava sample of cyclists and those reported in the 2011 Census for the same area are not as large as we initially expected.

Table 1. Percentage of Strava cyclists in Tyne and Wear, 2015, by age group and gender, compared with Census 2011 respondents who commute by bicycle.

To further check how reliable Strava data may be as a proxy measure for numbers of cyclists, we compared counts of Strava cyclists at specific locations in Glasgow against ‘ground truth’ data – manual counts of all cyclists at those same locations as recorded in the Glasgow City Centre cordon survey.

This survey is carried out over 2 days every September, covering 35 locations that form a perimeter around the city centre. Counts of cyclists are recorded in 30-minute bins. These counts were compared against the number of Strava cyclists passing through the same segment of road as the manual count location, for the years 2013 and 2014.

We used linear regression models to assess how well the Strava data predicted the ‘ground truth’ data. The Strava data performed well when it was aggregated across a whole year and used to predict the total ground truth counts over the 2-day period of the survey, producing an R-squared value of 0.82. Strava data performed less well when it was used to predict ground truth counts by the hour, rather than aggregated across a whole day – the R-squared value reduced to 0.75. When Strava data was further limited to just the two days on which the manual survey was carried out, its predictive power reduced further, with an R-squared value of 0.59. See Figure 1.

Figure 1. Left: Annual Strava counts for 2013 and 2014 compared with mean daily count over 2 days of manual survey for 2013 and 2014. N = 70. Right: Hourly Strava counts compared with counts from manual survey on same days, 2013 and 2014. N = 1960. Blue lines indicate linear regression fit.

Conclusion:

  1. The age and gender profile of Strava users closely matches that of people who cycle to work (as reported in the 2011 census).

  2. In terms of total numbers of cycling trips there is a good association between Strava data and on-road count data.

Strava data offers a powerful source of data about cycling activity at a detailed spatial and temporal level. Although some bias may exist within this sample of data when using it to estimate actual counts of cyclists, its predictive power is surprisingly good. We plan to use this ‘big data’ in future work examining the impact of lighting on cycling rates and cyclist safety.