scout is an innovative new way to browse New York City’s open data portal. Developed by Two Sigma Data Clinic in partnership with the NYC Mayor’s Office of Data Analytics, scout enhances data discoverability and collaboration by evaluating thematic similarity and joinability, and facilitating the creation of curated dataset collections.
How I Started
NYC's Open Data Portal has nearly 4,000 datasets. As the number of datasets continues to grow, so does the challenge of discovering those you need, and those you didn’t know you needed for richer context. scout addresses this by providing user-friendly filters for the user's initial search, and offering thematically similar and joinable datasets to enhance the search results.
How I Built This
Inspired by input from the Mayor’s Office of Data Analytics and Open Data Coordinators, scout is a web app powered by Socrata’s API, which provides metadata on all NYC datasets. A key focus in this tool is using data and machine learning techniques to surface datasets similar to a user’s search. We identify datasets that share common columns and evaluate how “joinable” they are by using the data dictionaries and samples of the data. To evaluate thematic similarity, we identify key topics from the data descriptions with the help of natural language processing tools that provide more nuanced results than keyword matching.
As the data and tech philanthropic arm of Two Sigma, Data Clinic provides pro bono data science and engineering support to nonprofits and engages in open source tooling and research that contribute to the Data for Good movement. We leverage Two Sigma’s people, data science skills, and technological know-how to support communities, mission driven organizations, and the broader public in their effort to use data more effectively.