A team of Duke students spent the summer creating a more efficient mechanism to map tobacco retailers in Durham County—which has over 250—and North Carolina, which has over 15,000.

In the United States, there is no national database of tobacco retailers or national licensing for the sale of tobacco. Students in the Information Initiative at Duke's Data+ summer program spent 10 weeks figuring out how to best map Durham County’s tobacco retailers in the hopes that one day, the map could be extended to include all tobacco retailers in the U.S.   

“It’s important to know where exactly these tobacco retailers are since youth are more likely to begin smoking in areas with lots of tobacco retailers,” said sophomore Felicia Chen, a participant in the Data+ research program.

Chen said it’s important to see where tobacco retailers are located on a map because places with dense clusters of retailers tend to be neighborhoods with lower incomes. The map can thus have potential for policy implications.

Only 37 states in the U.S. require licensing in order to sell tobacco, Chen noted. Many of those stores with licenses keep poor, handwritten or out-of-date records of their sales. The 13 other states do not require special permits or licenses for the retail distribution of tobacco—North Carolina falls in that category.

“This is pretty surprising since tobacco products consist of 36 percent of sales revenue in convenience stores, and obviously there’s major health effects from smoking,” Chen said.    

However, Chen explained that the lack of tobacco licensing laws was probably due to the incredible influence the tobacco industry has in the United States due to its economic contributions.   

The Data+ project was called “Open Data for Tobacco Retailer Mapping” and was sponsored by a North Carolina-based nonprofit called Counter Tools

Counter Tools has previously tried to create a database of tobacco retailers in Virginia, via a process called "ground-truthing." This method of data collection took three years to produce the necessary data since Counter Tools had to drive down every road in Virginia and visit every retailer to see if they sold tobacco. The research team with Data+ wanted to create a more efficient way of collecting this data.   

The team’s research consisted of three key steps. The first step was web-scraping, where they compiled a list of tobacco retailers in Durham by an R-code and an automated bot to scan data from Yellow Pages websites, such as retailers’ addresses and phone numbers. 

Since not all of the places on the list necessarily sold tobacco, for step two, the researchers employed “machine learning” where they tried to predict if they could tell whether a retailer sold tobacco purely based upon its name. They found that their algorithm correctly predicted tobacco retailers 85 percent of the time. 

Finally, the team cross-validated their results by paying people a small fee—between ten and 25 cents—to call the various retailers and ask them if they sold tobacco.   

While the team scraped all of North Carolina, they hyper-focused on Durham. They found 15,502 unique retailers in the state and 266 in Durham County. The project is still going on and the researchers said they hoped to expand their Durham map to include greater geographical coverage.   

“Because of the lack of regulation [of tobacco retailers], I think it’s especially important to know the locations [of them] so we can have better regulations,” Chen said.

Correction: This article was updated 12:00 p.m. Tuesday to reflect that the Data+ team found 15,502 retailers in North Carolina, not 15,503.