Strategies for Finding Statistics and Data

Step 1: Assess your data needs

'Build your awareness of the potentials and challenges!

Topic

What are your research questions and potential variables? For abstract concepts (e.g. happiness. economic freedom), look for an index (e.g. happiness index) for variables to consider.

Geography

Do you need country-level, national, or subnational (e.g. state, county. zip code) data? Subnational data may not always be available. Monetary cross-country comparison (e.g. real GDP>may need to be adjusted for the purchasing power parity/market-based exchange rate.

Time Period

Do you need data for a particular period in time? Historic data may have gaps, Collection methods can change over time (e.g. CPD: monetary time series may need to be adjusted for inflation. A time lag between data collection and its release is typical

Frequency

Do you need quarterly, monthly, or annual data? Some data may be available in daily increments (e.g. stock price). Some data will only be in 5 years (e.g. Economic Census). The frequency you expected may not be available. For data collected multiple times a year, seasonally adjusted data (e.g. retail. air travel) may be needed.

Granularity

Do you need microdata (with unit level data/individual responses) or summary data (e.g. data table)? Microdata may have restrictions in availability and use. Publicly available microdata may only be available through specific sources/repositories.

Method

Would your data have been collected via a survey/interview (e.g. public opinion), direct tracking (e.g. POS scanner), administrative reporting (e.g. crime incidents), etc.? Consider consistency and comparability when merging datasets. Collecting data is a costly effort--asking why it was collected is important for evaluating its quality!

Step 2: Ask who cares about the data

Understand layers of data sources by varied stakeholders and their pros & cons!

Government Agencies

(e.g. U.S. Census Bureau. Bureau of Labor Statistics) collect data via various surveys and release it as data tables, data files, data portals, or reports. Since the data sample, categories, and definition may be different from your understanding, read the data collection methodology very carefully.

Researchers at Academic/Research Institutions/Think Tanks

(e.g. Harvard University. Urban Institute, Pew Research Center) They conduct research, collect data, and publish reports, working papers, or journal articles. Original data may be restricted to the public and may be tied to specific research agendas that may be different from your own.

International Organizations

(e.g. World Bank. IMF. WTOl ) collect statistics and data from member countries and often share them for free. Their working papers/reports are often more timely and their methodologies are more detailed than journal articles. Data quality depends on member countries' data practices and the quality of the organization's assessment or evaluation frameworks.

Nonprofit Organizations

(e.g. Kaiser Family Foundation. GuideStar. Kauffman Foundation) invest in their mission-related data collection. Their data are valuable in filling some gaps in current government data and may be totally/partially free. However, be critical of potential biases that promote or further their interests.

Trade/Industry Associations

(Hospital Association. Risk Management Association) may collect data from their members. Factsheets and short reports are often free, but detailed data are often not free. Data may not be from random sampling. so may not be statistically reliable. Be aware of potential biases of the data towards the association's interests.

Data Archive/Repository

(e.g. ICPSR. Roper ipoll, UK Data Services) provides easy access to research data. Free self-archiving repositories (e.g. OpenlCPSR. Harvard Dataverse, GitHub. Kaggle) often do not appraise dataset quality. It may have privacy, confidentiality, or copyright violations, incomplete metadata, missing documentation, or the format you need may not be available.

Private Data Vendors

They compile data into a database (e.g. Bloomberg. Statista. Data Planet: IRI) and make scattered data more available and accessible. Databases are often expensive, but the data can be contaminated by missing values, errors, inconsistencies, and standardization, rounding, or selection bias. Use it as a pointer to find original data and make sure to verify data accuracy.

Libraries

(e.g. AOK Library at UMBC) The provide some paid statistical databases from data vendors or archives. Librarians create library guides to help users find statistics and data and develop data literacy. Library guides are a helpful tool to find many free and paid data sources, but may be incomplete or not up-to-date, so always check guides from different libraries for publicly available datasets.

Step 3: Search through different paths

"Being flexible and persistent is the key to success!

Literature

Find scholarly articles or working papers via Google Scholar or library databases to understand your topic and variables.

Data Aggregators

Databases such as Statistical Insight are a good place to start to find pointers to original data sources.

Library Guides

Search library guides on your data topic. It will save you a lot of time since these pages list multiple sources in one place. Consult several guides to build your own data source list.

Online Searches

Google dataset search and Google Advanced Search can help to locate specific data.

Data Portals

Search data portals such (eg. Explore Census Data, The Maryland Open Data Portal). Their embedded data may not be accessible via a Google search.

Microdata

Use data repositories such as ICPSR, IPUMS, etc) or search “microdata files online”.

Restricted Data

(eg CDC Vital Statistics County Level Data) doesn’t mean inaccessible. It can be required through a request-approval process.

Ask for Help

Many people are here to help you. Librarians, statistical agency staff, and repository data experts. Don’t hesitate to ask.

This content was created by Grace Liu, Business Librarian, West Chester University and is on a Creative Commons CC BY License.

Tips for Googling for Data

Put phrases in quotation marks to search for the exact phrase, for example you can google search “heart attack” to find info on heart attacks.

Use OR to search for all possible synonyms, for example, googling for “heart attack” OR “myocardial infarction" to also find info that uses the medical terms for heart attacks. Hint: you can use your favorite AI tool to find synonyms
To find data or statistics for on a topic, put your synonyms for the topic in parentheses, and add all possible synonyms for data in another set of parenthesis, for example, google searching for ("heart attack" OR "myocardial infarction") (data OR dataset OR stats OR statistics) finds data and statistics on heart attacks.

Limit to a region by going to “Tools” then “Advanced search.” Scroll down to “narrow you results by” section and click the dropdown by “region,” and select the region that your interested in.

Limit your results to a particular website or set of websites by adding the site operator, “site:” followed the site or sites you want data from, for example google searching, ("heart attack" OR "myocardial infarction") (data OR dataset OR stats OR statistics) site:.gov finds only government data on heart attacks.

Remove a website from your search results by using the “-site” operator, for example, googling ("heart attack" OR "myocardial infarction") (data OR dataset OR stats OR statistics) site:.gov -site:https://www.cdc.gov finds data on heart attacks on all government websites except the CDC’s.

Limit your results to just the file types you’re interested in by using the “filetype:” operator followed by the file type or types that you’re interested in, for example, googling ("heart attack" OR "myocardial infarction") (filetype:xls OR filetype:xlsx OR filetype:csv) finds just those file types.

Tools for Finding Data and Data Repositories

Google Dataset Search

re3data.org An extensive list of discipline-specific repositories.

Repository Finder A new tool recently launched by DataCite for helping people identify and locate online repositories of research data. Draws from the re3data listings for repository information.

Open Access Directory's Data Repositories Wiki A list of repositories and databases for locating and depositing open data.

Data Repositories by Discipline

Inter-University Consortium for Political and Social Research (ICPSR)
An international consortium of about 700 academic institutions and research organizations, ICPSR provides leadership and training in data access, curation, and methods of analysis for the social science research community. ICPSR maintains a data archive of more than 500,000 files of research in the social sciences. UMBC is a member giving all authorized library users free full access. Be sure you use your UMBC email address when creating an account to be authenticated as a member. Here is more of what ICPSR offers you as a member:

You can also deposit prepared data in ICPSR to share your data.

Qualitative Data Repository.
QDR is a dedicated repository for preserving and sharing the digital assets associated with social science and mixed methods projects. It was founded with support from the National Science Foundation and the Center for Qualitative and Multi-Method Inquiry, a unit of the Maxwell School of Citizenship and Public Affairs at Syracuse University.

Australian Social Science Data Archive.
From the Australian Demographic and Social Research Institute at the Australian National University.

CESSDA Data Portal.
From the Council of European Social Science Data Archives (CESSDA).

National Neighborhood Data Archive(NaNDA).
The National Neighborhood Data Archive (NaNDA) is a publicly available data archive containing measures of the physical, economic, demographic, and social environment at multiple levels of spatial scale (eg, census tract, ZIP code tabulation area, county). Each NaNDA dataset covers all or most of the entire nation (including both rural and urban areas) and represents a set of measures on a single topic of interest, including socioeconomic disadvantage, healthcare, housing, partisanship, and public transit, with temporal coverage dating back to 2000.

Digital Repositories E-Science Network (DReSNeT).
From the UK Engineering & Physical Sciences Research Council (EPSRC). A network of social science repositories for texts and data.

Astronomy

Astronomical Data Archives Center
From the National Astronomical Observatory of Japan .
Astrophysics Data System
From the Smithsonian Astrophysical Observatory (SAO) and National Aeronautics and Space Administration (NASA).
National Space Science Data Center
From the US National Aeronautics and Space Administration (NASA).

Biology

The Cell: An Image Library
Images of all cell types from all organisms, including intracellular structures and movies or animations demonstrating functions. This project relies upon the cell biology community to populate the library. Freely accessible, easy-to-search, public repository of reviewed and annotated images, videos, and animations of cells from a variety of organisms, showcasing cell architecture, intracellular functionalities, and both normal and abnormal processes.

DataBasin
OA data in conservation. From the Conservation Biology Institute in partnership with Rhiza Labs.

GENBank
The NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI.
Global Biodiversity Information Facility (GBIF)
"Free and open access to biodiversity data." Launched in 2007 by institutions in 17 countries under a non-binding inter-governmental agreement.

MorphoBank
"Homology of phenotypes over the web." Hosted by the State University of New York at Stony Brook.

Morphbank
Holds biological Imaging documents from a wide variety of research including: specimen-based research in comparative anatomy, morphological phylogenetics, taxonomy and related fields focused on increasing our knowledge about biodiversity. The project receives its main funding from the Biological Databases and Informatics program of the National Science Foundation (Grant DBI-0446224).
PaleoBiology Database
"We are bringing together taxonomic and distributional information about the entire fossil record of plants and animals from a large number of researchers at a large number of institutions."

TreeBASE
"A Database of Phylogenetic Knowledge." Released in March 2010 based on a prototype launched in 1994. Hosted by the Phyloinformatics Research Foundation.

Chemistry

The Cambridge Crystallographic Data Centre (CCDC)
The CCDC is a non-profit, charitable Institution whose objectives are the general advancement and promotion of the science of chemistry and crystallography for the public benefit.
Crystallography Open Database
A joint project of the Mineralogical Society of America, Mineralogical Association of Canada, European Journal of Mineralogy,International Union of Crystallography, and the US National Science Foundation. Data are in the public domain.
PubChem
From the U.S. National Center for Biotechnology Information of the National Institutes of Health (NIH).

ZINC
"A free database of commercially-available compounds for virtual screening." From the Shoichet Laboratory in the Department of Pharmaceutical Chemistry at the University of California, San Francisco.

Computer Science

GitHub
Keeps your public and private code available, secure, and backed up.

SourceForge
2.7 million developers create powerful software in over 260,000 projects. Our popular directory connects more than 46 million consumers with these open source projects and serves more than 2,000,000 downloads a day. SourceForge is where open source happens.

SNAP
Stanford Large Network Dataset Collection. The SNAP library is being actively developed since 2004 and is organically growing as a result of our research pursuits in analysis of large social and information networks. Largest network we analyzed so far using the library was the Microsoft Instant Messenger network from 2006 with 240 million nodes and 1.3 billion edges.

Energy

DOE Data Explorer
From the US Department of Energy (DOE). Data generated by DOE-sponsored research.

OpenEI: Open Energy Information
Freely-available energy data, tools, models, and other resources.

Environmental Sciences

Climate Change Data Portal
From the Environment Department of the World Bank.

The Marine Geoscience Data System (MGDS)
The Marine Geoscience Data System (MGDS) provides access to data portals for the NSF-supported Ridge 2000 and MARGINS programs, the Antarctic and Southern Ocean Data Synthesis, the Global Multi-Resolution Topography Synthesis, and Seismic Reflection Field Data Portal.

National Ecological Observatory Network
(NEON). A joint project of 50+ US universities and laboratories.

Geology

GSA Data Repository
From the Geological Society of America.

IRIS (Incorporated Research Institutions for Seismology).
From 100+ US universities and the National Science Foundation.

Geosciences & Geospatial Data

EarthChem
Holds data systems and services for geochemical, geochronological, and petrological data, developed and maintained by EarthChem, including the EarthChem Library, the EarthChem Portal, PetDB, NAVDAT, SedDB, and Geochron. EarthChem is operated by a joint team of disciplinary scientists, data scientists, data managers and information technology developers who are part of the NSF-funded data facility Integrated Earth Data Applications (IEDA).

Geodata Repository
From the Open Source Geospatial Foundation.

The Geosciences Network (GEON)
This project is a collaboration among a dozen PI institutions and a number of other partner projects, institutions, and agencies to develop cyberinfrastructure in support of an environment for integrative geoscience research. GEON is funded by the NSF Information Technology Research (ITR) program.

National Geographic Data Center
An archive of national and international marine environmental and ecosystem datasets.

The National Space Science Data Center
This serves as the permanent archive for NASA space science mission data. "Space science" means astronomy and astrophysics, solar and space plasma physics, and planetary and lunar science. As permanent archive, NSSDC teams with NASA's discipline-specific space science "active archives" which provide access to data to researchers and, in some cases, to the general public.

Medicine

All of Us Research Hub
The Research Hub houses one of the largest, most diverse, and most broadly accessible datasets ever assembled. It also provides an interactive Data Browser where anyone can learn about the type and quantity of data that All of Us collects. Users can explore aggregate data including genomic variants, survey responses, physical measurements, electronic health record information, and wearables data.

Gene Expression Omnibus
From the U.S. National Center for Biotechnology Information of the National Institutes of Health.

MIRAGE (Middlesex medical Image Repository with a CBIR ArchivinG Environment).
From JISC and Middlesex University.

National Center for Biotechnology Information (NCBI)
The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information.

NeuroMorpho
Neuronal morphology data. From the Krasnow Institute for Advanced Study at George Mason University.
Virginia Henderson Global Nursing e-Repository
Nursing research data.

Physics

Blue Obelisk Data Repository
Repository of isotope masses, under MIT license. From the Blue Obelisk. Described in 10.1021/ci050400b.

CERN Scientific Information
Online particle physics data and information

Nist Atomic Spectra Database
The Atomic Spectra Database (ASD) contains data for radiative transitions and energy levels in atoms and atomic ions. Data are included for observed transitions of 99 elements and energy levels of 56 elements.

Historical Statistics of the United States (UMBC Subscribes)

CORE Repository (MLA)
A service offered as part of the MLA Commons, the Commons Open Repository Exchange offers a place to store and publish digital assets and data in the humanities.

HumanitiesCommons
Humanities Commons is a repository for the humanities. Discover the latest open-access scholarship and teaching materials, make interdisciplinary connections, build a WordPress Web site, and increase the impact of your work by sharing it in the repository.

DataONE
An international federation of data repositories containing earth observations data, including data from fields such as ecology, biology, evolution, and environmental sciences such as hydrology, oceanography, and atmospheric science. DataONE is a federation with participation from hundreds of field stations, universities, and government agencies through the DataONE Member Nodes.

Dryad
An international repository of data underlying scientific and medical publications, particularly data for which no specialized repository exists. All material in Dryad is associated with a scholarly publication. Most data in the repository are associated with peer-reviewed articles, although data associated with non-peer reviewed publications from reputable academic sources, such as dissertations, are also accepted. Dryad is a non-profit organization.

Entrez databases
A directory of chemical, biochemical, biomedical, and medical databases from the U.S. National Center for Biotechnology Information of the National Institutes of Health.
FigShare
FigShare allows you to share all of your data, negative results and unpublished figures.
KNB
The Knowledge Network for Biocomplexity (KNB) is an international data repository containing ecology, biology, and environmental science data with a global distribution. The KNB is a grass-roots partnership of collaborating feld stations, laboratories, and research networks that openly publish and share data. The KNB is a Member Node within the DataONE data federation.
PANGAEA
Stands for "Publishing Network for Geoscientific & Environmental Data". Open to deposits from any scientist. Most datasets are open; some are restricted. Hosted by the Alfred Wegener Institute for Polar and Marine Research and the University of Bremen's Center for Marine Environmental Sciences.
Public Data Sets on AWS from Amazon Web Services.
The site already hosts OA datasets in biology, chemistry, and economics, and is willing to host them in any field.
Proquest Statistical Insights (UMBC subscribes)

Research Data

Credits

Librarian