Skip to Main Content

Research Data

Strategies for Finding Statistics and Data

Step 1: Assess your data needs

'Build your awareness of the potentials and challenges!

 Topic

What are your research questions and potential variables? For abstract concepts (e.g. happiness. economic freedom), look for an index (e.g. happiness index) for variables to consider.

Geography

Do you need country-level, national, or subnational (e.g. state, county. zip code) data? Subnational data may not always be available. Monetary cross-country comparison (e.g. real GDP>may need to be adjusted for the purchasing power parity/market-based exchange rate.

Time Period

Do you need data for a particular period in time? Historic data may have gaps, Collection methods can change over time (e.g. CPD: monetary time series may need to be adjusted for inflation. A time lag between data collection and its release is typical

Frequency

Do you need quarterly, monthly, or annual data? Some data may be available in daily increments (e.g. stock price). Some data will only be in 5 years (e.g. Economic Census). The frequency you expected may not be available. For data collected multiple times a year, seasonally adjusted data (e.g. retail. air travel) may be needed.

Granularity

Do you need microdata (with unit­ level data/individual responses) or summary data (e.g. data table)? Microdata may have restrictions in availability and use. Publicly available microdata may only be available through specific sources/repositories.

Method

Would your data have been collected via a survey/interview (e.g. public opinion), direct tracking (e.g. POS scanner), administrative reporting (e.g. crime incidents), etc.? Consider consistency and comparability when merging datasets. Collecting data is a costly effort--asking why it was collected is important for evaluating its quality!

Step 2: Ask who cares about the data

Understand layers of data sources by varied stakeholders and their pros & cons!

Government Agencies

(e.g. U.S. Census Bureau. Bureau of Labor Statistics) collect data via various surveys and release it as data tables, data files, data portals, or reports. Since the data sample, categories, and definition may be different from your understanding, read the data collection methodology very carefully.

Researchers at Academic/Research Institutions/Think Tanks

 (e.g. Harvard University. Urban Institute, Pew Research Center)  They conduct research, collect data, and publish reports, working papers, or journal articles. Original data may be restricted to the public and may be tied to specific research agendas that may be different from your own.

International Organizations

 (e.g. World Bank. IMF. WTOl ) collect statistics and data from member countries and often share them for free. Their working papers/reports are often more timely and their methodologies are more detailed than journal articles. Data quality depends on member countries' data practices and the quality of the organization's assessment or evaluation frameworks.

Nonprofit Organizations

(e.g. Kaiser Family Foundation. GuideStar. Kauffman Foundation) invest in their mission-related data collection. Their data are valuable in filling some gaps in current government data and may be totally/partially free. However, be critical of potential biases that promote or further their interests.

Trade/Industry Associations

(Hospital Association. Risk Management Association) may collect data from their members. Factsheets and short reports are often free, but detailed data are often not free. Data may not be from random sampling. so may not be statistically reliable. Be aware of potential biases of the data towards the association's interests.

Data Archive/Repository

(e.g. ICPSR. Roper ipoll, UK Data Services) provides easy access to research data. Free self-archiving repositories (e.g. OpenlCPSR. Harvard Dataverse, GitHub. Kaggle) often do not appraise dataset quality. It may have privacy, confidentiality, or copyright violations, incomplete metadata, missing documentation, or the format you need may not be available.

Private Data Vendors

They compile data into a database (e.g. Bloomberg. Statista. Data­ Planet: IRI) and make scattered data more available and accessible. Databases are often expensive, but the data can be contaminated by missing values, errors, inconsistencies, and standardization, rounding, or selection bias. Use it as a pointer to find original data and make sure to verify data accuracy.

Libraries

 (e.g. AOK Library at UMBC) The provide some paid statistical databases from data vendors or archives. Librarians create library guides to help users find statistics and data and develop data literacy. Library guides are a helpful tool to find many free and paid data sources, but may be incomplete or not up-to-date, so always check guides from different libraries for publicly available datasets.

Step 3: Search through different paths

"Being flexible and persistent is the key to success!

Literature

Find scholarly articles or working papers via Google Scholar or library databases to understand your topic and variables.

Data Aggregators

Databases such as Statistical Insight are a good place to start to find pointers to original data sources.

Library Guides

Search library guides on your data topic. It will save you a lot of time since these pages list multiple sources in one place. Consult several guides to build your own data source list.

Online Searches

Google dataset search and Google Advanced Search can help to locate specific data.

Data Portals

Search data portals such (eg. Explore Census Data, The Maryland Open Data Portal). Their embedded data may not be accessible via a Google search.

Microdata

Use data repositories such as ICPSR, IPUMS, etc) or search “microdata files online”.

Restricted Data

(eg CDC Vital Statistics County Level Data) doesn’t mean inaccessible. It can be required through a request-approval process.

Ask for Help

Many people are here to help you. Librarians, statistical agency staff, and repository data experts. Don’t hesitate to ask.

This content was created by Grace Liu, Business Librarian, West Chester University and is on a Creative Commons CC BY License.

Tips for Googling for Data

  • Put phrases in quotation marks to search for the exact phrase, for example you can google search “heart attack” to find info on heart attacks.

  • Use OR to search for all possible synonyms, for example, googling for “heart attack” OR “myocardial infarction" to also find info that uses the medical terms for heart attacks.  Hint: you can use your favorite AI tool to find synonyms
  • To find data or statistics for on a topic, put your synonyms for the topic in parentheses, and add all possible synonyms for data in another set of parenthesis,  for example,  google searching for  ("heart attack" OR "myocardial infarction") (data OR dataset OR stats OR  statistics) finds data and statistics on heart attacks.

  • Limit to a region by going to “Tools” then “Advanced search.” Scroll down to “narrow you results by” section and click the dropdown by “region,” and select the region that your interested in.

  • Limit your results to a particular website or set of websites by adding the site operator, “site:” followed the site or sites you want data from, for example google searching, ("heart attack" OR "myocardial infarction") (data OR dataset OR stats OR  statistics) site:.gov finds only government data on heart attacks.

  • Remove a website from your search results by using the “-site” operator, for example, googling ("heart attack" OR "myocardial infarction") (data OR dataset OR stats OR statistics) site:.gov -site:https://www.cdc.gov finds data on heart attacks on all government websites except the CDC’s.

  • Limit your results to just the file types you’re interested in by using the “filetype:” operator followed by the file type or types that you’re interested in, for example, googling  ("heart attack" OR "myocardial infarction")  (filetype:xls OR filetype:xlsx OR filetype:csv) finds just those file types.

Tools for Finding Data and Data Repositories

 

  • re3data.org An extensive list of discipline-specific repositories.
  • Repository Finder A new tool recently launched by DataCite for helping people identify and locate online repositories of research data. Draws from the re3data listings for repository information. 

 

Data Repositories by Discipline

You can also deposit prepared data in ICPSR to share your data.

 

  • Qualitative Data Repository.
    QDR is a dedicated repository for preserving and sharing the digital assets associated with social science and mixed methods projects. It was founded with support from the National Science Foundation and the Center for Qualitative and Multi-Method Inquiry, a unit of the Maxwell School of Citizenship and Public Affairs at Syracuse University.

 

 

 

  • National Neighborhood Data Archive(NaNDA).
    The National Neighborhood Data Archive (NaNDA) is a publicly available data archive containing measures of the physical, economic, demographic, and social environment at multiple levels of spatial scale (eg, census tract, ZIP code tabulation area, county). Each NaNDA dataset covers all or most of the entire nation (including both rural and urban areas) and represents a set of measures on a single topic of interest, including socioeconomic disadvantage, healthcare, housing, partisanship, and public transit, with temporal coverage dating back to 2000.

 

Astronomy

Biology

  • The Cell: An Image Library
    Images of all cell types from all organisms, including intracellular structures and movies or animations demonstrating functions. This project relies upon the cell biology community to populate the library. Freely accessible, easy-to-search, public repository of reviewed and annotated images, videos, and animations of cells from a variety of organisms, showcasing cell architecture, intracellular functionalities, and both normal and abnormal processes.

  • Morphbank
    Holds biological Imaging documents from a wide variety of research including: specimen-based research in comparative anatomy, morphological phylogenetics, taxonomy and related fields focused on increasing our knowledge about biodiversity. The project receives its main funding from the Biological Databases and Informatics program of the National Science Foundation (Grant DBI-0446224).

  • PaleoBiology Database
    "We are bringing together taxonomic and distributional information about the entire fossil record of plants and animals from a large number of researchers at a large number of institutions."

Chemistry

Computer Science

  • GitHub
    Keeps your public and private code available, secure, and backed up.

  • SourceForge
    2.7 million developers create powerful software in over 260,000 projects. Our popular directory connects more than 46 million consumers with these open source projects and serves more than 2,000,000 downloads a day. SourceForge is where open source happens.

  • SNAP
    Stanford Large Network Dataset Collection. The SNAP library is being actively developed since 2004 and is organically growing as a result of our research pursuits in analysis of large social and information networks. Largest network we analyzed so far using the library was the Microsoft Instant Messenger network from 2006 with 240 million nodes and 1.3 billion edges.

Energy

Environmental Sciences

  • The Marine Geoscience Data System (MGDS)
    The Marine Geoscience Data System (MGDS) provides access to data portals for the NSF-supported Ridge 2000 and MARGINS programs, the Antarctic and Southern Ocean Data Synthesis, the Global Multi-Resolution Topography Synthesis, and Seismic Reflection Field Data Portal.

Geology

  • IRIS (Incorporated Research Institutions for Seismology).
    From 100+ US universities and the National Science Foundation.

Geosciences & Geospatial Data

  • EarthChem
    Holds data systems and services for geochemical, geochronological, and petrological data, developed and maintained by EarthChem, including the EarthChem Library, the EarthChem Portal, PetDB, NAVDAT, SedDB, and Geochron. EarthChem is operated by a joint team of disciplinary scientists, data scientists, data managers and information technology developers who are part of the NSF-funded data facility Integrated Earth Data Applications (IEDA).

  • The Geosciences Network (GEON)
    This project is a collaboration among a dozen PI institutions and a number of other partner projects, institutions, and agencies to develop cyberinfrastructure in support of an environment for integrative geoscience research. GEON is funded by the NSF Information Technology Research (ITR) program.

  • The National Space Science Data Center
    This serves as the permanent archive for NASA space science mission data. "Space science" means astronomy and astrophysics, solar and space plasma physics, and planetary and lunar science. As permanent archive, NSSDC teams with NASA's discipline-specific space science "active archives" which provide access to data to researchers and, in some cases, to the general public.

Medicine

  • All of Us Research Hub
    The Research Hub houses one of the largest, most diverse, and most broadly accessible datasets ever assembled. It also provides an interactive Data Browser where anyone can learn about the type and quantity of data that All of Us collects. Users can explore aggregate data including genomic variants, survey responses, physical measurements, electronic health record information, and wearables data.

  • MIRAGE (Middlesex medical Image Repository with a CBIR ArchivinG Environment).
    From JISC and Middlesex University.

Physics

  • Nist Atomic Spectra Database
    The Atomic Spectra Database (ASD) contains data for radiative transitions and energy levels in atoms and atomic ions. Data are included for observed transitions of 99 elements and energy levels of 56 elements.

 

  • CORE Repository (MLA)
    A service offered as part of the MLA Commons, the Commons Open Repository Exchange offers a place to store and publish digital assets and data in the humanities.

 

  • HumanitiesCommons
    Humanities Commons is a repository for the humanities. Discover the latest open-access scholarship and teaching materials, make interdisciplinary connections, build a WordPress Web site, and increase the impact of your work by sharing it in the repository.
  • DataONE 
    An international federation of data repositories containing earth observations data, including data from fields such as ecology, biology, evolution, and environmental sciences such as hydrology, oceanography, and atmospheric science. DataONE is a federation with participation from hundreds of field stations, universities, and government agencies through the DataONE Member Nodes.

  • Dryad 
    An international repository of data underlying scientific and medical publications, particularly data for which no specialized repository exists. All material in Dryad is associated with a scholarly publication. Most data in the repository are associated with peer-reviewed articles, although data associated with non-peer reviewed publications from reputable academic sources, such as dissertations, are also accepted. Dryad is a non-profit organization.