Created by Two Sigma Data Clinic
Launch Project
Data Discovery Data Management

How I Started

NYC's Open Data Portal has nearly 4,000 datasets. As the number of datasets continues to grow, so does the challenge of discovering those you need, and those you didn’t know you needed for richer context. scout addresses this by providing user-friendly filters for the user's initial search, and offering thematically similar and joinable datasets to enhance the search results.

How I Built This

Inspired by input from the Mayor’s Office of Data Analytics and Open Data Coordinators, scout is a web app powered by Socrata’s API, which provides metadata on all NYC datasets. A key focus in this tool is using data and machine learning techniques to surface datasets similar to a user’s search. We identify datasets that share common columns and evaluate how “joinable” they are by using the data dictionaries and samples of the data. To evaluate thematic similarity, we identify key topics from the data descriptions with the help of natural language processing tools that provide more nuanced results than keyword matching.

Two Sigma Data Clinic @tsdataclinic

As the data and tech philanthropic arm of Two Sigma, Data Clinic provides pro bono data science and engineering support to nonprofits and engages in open source tooling and research that contribute to the Data for Good movement. We leverage Two Sigma’s people, data science skills, and technological know-how to support communities, mission driven organizations, and the broader public in their effort to use data more effectively.