All Guides: Research Data: Data Integration, Analysis, and Visualization

Introduction to Data Integration, Analysis, and Visualization

Data integration, analysis, and visualization is achieved using mathematical and statistical software and programming tools. UMBC purchases a number of commercial tools for students, faculty, and staff to use, and there are additionally a number of powerful open source (free) tools. For the UMBC purchased tools, we've provided a link to DoIT's information on the tool, including a description of it's capabilities, installation instructions, and in some instances, information on how to get started with the tool. For the free tools, we've provided more in-depth information along with links to information and courses on using the tool.

Note that data software expect tidy data--please see the below section on how to ensure you're data is ready to be opened for analysis.

A great UMBC resource is The Center for Interdisciplinary Research and Consulting (CIRC), which a consulting service for mathematics and statistics provided by the Department of Mathematics and Statistics. It provides workshops on R, Matlab, Stata, SPSS, etc.

Cleaning Data for Analysis

Analysis Software Expect Tidy Data. Data is tidy when each variable is in a column, each observation is in a row, and each cell has one value. To ensure your data is tidy:

Make sure the first row has column headers
Remove merged or nested cells
If multiple variables are stored in one column, separate them.
Remove any inconsistencies in data type., eg. If years are recorded in both words and numbers, ten and 10, change them all to numbers.
Make sure that missing values are clearly distinguished from the value 0.
Check for any other anomalies in the data and fix them.

Software Available to the UMBC Community

Students

Mathematica and Wolfram Alpha Pro
MATLAB for Students
NVivo (for Graduate Students only)
SAS
SPSS
Stata
Policy Map--An interactive online mapping tool with which you may create maps, tables and graphs covering Demographics, Incomes & Spending, and more.

Faculty & Staff

Mathematica and Wolfram Alpha Pro
MATLAB for Faculty/Staff
NVivo
SAS
SPSS
Stata
Policy Map--An interactive online mapping tool with which you may create maps, tables and graphs covering Demographics, Incomes & Spending, and more.

Open Source Software

Tableau Public is a free limited version of Tableau, a business intelligence app. With Tableau Public you can easily create visualizations, dashboards, and data stories, but everything in it is public, so it's not suitable for any private or confidential data.

Operation Systems Needed and Installation

Tableau Public is available for Windows and MacOS. Download from the Tableau Public site. When downloading is complete, click on the Tableau Public executable, and follow the prompts.

Documentation, Classes, and other Sources of Information

Examples of data visualizations created with Tableau Public are in the Tableau Gallery.

Tableau Public has a series of 22 short videos to teach people how to use it, available here: https://public.tableau.com/en-us/s/resources.Tableau Public help is available here: https://www.tableau.com/support/public.

An ebook on Tableau Public is available through the A.O.K. library:

Creating data stories with Tableau Public : illustrate your data in a more interactive and interesting way using Tableau Public

OpenRefine is an open source tool that allows users to load data, clean it quickly and accurately, transform it, and even geocode it.

The main use of OpenRefine is data processing and transformation to other formats. What’s more, is that all actions that were done on a dataset are stored in a project and can be replayed on another dataset!

Why Use OpenRefine?

Simple installation
Extensive documentation
Lots of great import formats: TSV, CSV, XML, RDF Triples, JSON, Google Sheets, Excel
Upload from local drive or import from URL
Many export formats: TSV, CSV, Excel, HTML table
Works with large-ish datasets (100,000 rows). Can adjust memory allocation to accommodate larger datasets.
Data remains on your computer, so nothing is shared until you choose to share it.
Useful extensions: geoXtension, Opentree for phylogenetic trees from Open Tree of Life, and many more (listed here, scroll to ‘extensions’)!
Active development community

Key Features

Facets

One of the most powerful operations that OpenRefine has to offer are facets. When you look at facets for a given column, it shows all unique entries with frequencies. You can use that to get a feel for how consistent your data is. You can also use facets to subset rows that you want to change in bulk. The facet information always appears in the left hand panel in the OpenRefine interface. There are:

Numeric facets
Timeline facets (for dates)
Custom facets
Scatterplot facets

Some of the default custom facets are:

Word facet - this breaks down text into words and counts the number of records each word appears in
Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.

Google Refine Expression Language

GREL stands for the Google Refine Expression Language, and it’s a way we can automate changes in OpenRefine. You can use GREL to query APIs, change data formats, split columns, and a whole lot more. OpenRefine lets you choose between GREL, Python or Jython (an implementation of python designed to run on the Java platform), or Clojure (dialect of the Lisp programming language). You can use GREL to mass-process data. You can use regular expressions in GREL to powerfully repurpose and redefine your data! A regular expression, regex, is a sequence of characters that define a search pattern. You can even use GREL to call Google Maps API to get lat/longs for datasets where you have addresses. The possibilities with GREL are endless!

Operating Systems Needed and Installation

OpenRefine runs on Windows, MacOS, and Linux. Installation Instructions and downloads of the program are available here: https://openrefine.org/download.html.

Documentation, Classes, and other Sources of Information

Both a FAQ and User Manual are available.

A number of online courses are available:

An e book on OpenRefine is available through the Library: Using OpenRefine : the essential OpenRefine guide that takes you from data analysis and error fixing to linking your dataset to the Web

R is both an extensible language and environment for statistical computing and data visualization. It can be used with Jupyter Notebooks, described below.

R includes:

data handling and storage
operators for calculations on arrays and matrices,
a large collection of tools for data analysis,
graphics for data analysis and display
a programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities
functions to define new capabilities
ability to link and call C, C++, and Fortran code
many extension packages

Using R for Free Online: Create a free Posit account to use it in the cloud.

Operating Systems Needed and Installation

R can compile and run on Windows, MACOS, Linux, FreeBSD, and a variety of Unix platforms.

To install, first find a CRAN site on this page, https://cran.r-project.org/mirrors.html, then find the appropriate download . A variety of help documentation is available on the CRAN sites, and a manual that includes installation is included. Manuals that include installation instructions are also available here:https://cran.r-project.org/manuals.html (the most current version of the manuals are listed under R-release) and on other CRAN sites.

Documentation, Classes, and other Sources of Information

The r-Project websites, https://www.r-project.org/ includes manuals and an FAQ.

Codecademy and Coursera have free R tutorials and courses. There are many R courses available for a fee--Code Spaces has ranked courses and provided a Top Ten R Courses in 2021. You can also learn about R from books--many are available for free via AOK OneSearch. Here are some select ebooks on R:

There are more specialized ebooks on using R with specific types of data and specific disciplines, so you might want to search the library OneSearch for a book that more specifically addresses what you want to do.

Python is a programming language that can be used to analyze, explore, and visualize data. Python can be used with Jupyter Notebooks, described below.

For those with programming experience, Python can be easier to use than R
Python is a full programming language and R is statistical software
Allows for machine learning
Allows for data and data analysis tasks to be integrate with the web
Those with statistics experience are better off using R

Operating System Needed and Installation

Python runs on Windows, Mac, and Linux. Python comes already installed on many computer. Instructions for seeing if it's already installed on your computer, and installing if it's not, are available here: https://wiki.python.org/moin/BeginnersGuide/Download.

Documentation, Classes, and Sources of Information

Because Python is a full programming language, we'll focus specifically on using Python for data--otherwise you'll learn how to do many thing things that you don't want to do! If you're new to python, and want to use it specifically for data, you should start with python courses specifically for data, such as DataCamp's or DataQuest's. These aren't free, but are self-paced, require writing real code, and use real data.

Python has a user guide. We don't recommend that beginners only interested in Python for data use the beginners guide for non-programmers--instead learn from something that focuses exclusively on using Python with data.

Pandas is the Python Data Analysis Library. Installation instructions and beginner's tutorials are available here: https://pandas.pydata.org/getting_started.html. It's user guide is here: https://pandas.pydata.org/docs/user_guide/index.html. There are a lot of tutorials and courses available for data in Python. Here are a few free ones:

Python Data Science Tutorial (Real Python)
Coursera
CodeAcademy
TutorialsPoint

The library also has many ebooks on using Python for data:

There are more specialized ebooks on using Python with specific types of data and specific disciplines, so you might want to search the library OneSearch for a book that more specifically addresses what you want to do.

The Jupyter Notebook is an interactive computing environment that enables users to author notebook documents that include code, interactive widgets, plots, narrative text, equations, images and even video!

It includes:

In-browser editing for code, with automatic syntax highlighting, indentation, and tab completion/introspection.
The ability to execute code from the browser, with the results of computations attached to the code which generated them.
Displaying the result of computation using rich media representations, such as HTML, LaTeX, PNG, SVG, etc.
In-browser editing for rich text using the Markdown markup language, which can provide commentary for the code, and is not limited to plain text.
The ability to easily include mathematical notation within markdown cells using LaTeX, and rendered natively by MathJax.

The Jupyter Notebook combines three components:

The notebook web application: An interactive web application for writing and running code interactively and authoring notebook documents.
Kernels: Separate processes started by the notebook web application that runs users’ code in a given language (e.g. python, R, Julia, Go, and more -- get the full list of kernels from the wiki) and returns output back to the notebook web application. The kernel also handles things like computations for interactive widgets, tab completion and introspection.
Notebook documents: Self-contained documents that contain a representation of all content visible in the notebook web application, including inputs and outputs of the computations, narrative text, equations, images, and rich media representations of objects. Each notebook document has its own kernel. You can export your notebook as many other formats, even LaTex and PDF!

Operating System Needed and Installing Jupyter Notebooks

Jupyter Notebooks runs on Windows, MacOS, and Linux.

You can install Jupyter notebooks and some key kernels on your computer in a few ways:

Our recommended method is to download using Anaconda (make sure you select version 3.*), which gives you Jupyter, python 3, and a lot of key python libraries for research: https://www.anaconda.com/download/. After you've finished downloading + installing with Anaconda, you should see an application "Jupyter notebooks" in your list of applications.

If you're comfortable with the terminal you can also install Jupyter Notebooks with pip:

python3 -m pip install --upgrade pip
python3 -m pip install jupyter
jupyter notebooks # launches the notebook interface

Additional installations possibilities are available here: https://jupyter.readthedocs.io/en/latest/install.html

Documentation, Classes, and other Sources of Information

Jupyter Notebooks documentation is available here: https://jupyter-notebook.readthedocs.io/en/stable/index.html.

There are many free Jupytor Notebooks tutorials available to get you started. Here are a few:

How to Use Jupyter Notebook in 2020: A Beginner's Tutorials (Dataquest)
Jupyter Notebooks Tutorial: The Definitive Guide (Datacamp)
Using Jupyter Notebooks (Real Python)

There are also many Jupyter Notebooks courses available for a fee. Douglas Hollis provides a list of top 10 Jupyter Notebooks Training Courses.

You can also learn about Jupyter Notebooks from books--a few are available for free via AOK OneSearch. Here are some select ebooks on Jupyter

There are more specialized ebooks on using Jupyter Notebooks with specific types of data and specific disciplines, so you might want to search the library OneSearch for a book that more specifically addresses what you want to do.

QGIS is a free system for working with geospatial data. It has many mapping and analytic abilities built in. Additional uses may require plug-ins.

Operating System Needed and Installation

QGIS runs on Windows Mac, and Linux. Installers are available on the QGIS installer page.

Documentation, Classes, and Sources of Information

Documentation is available here: https://qgis.org/resources/hub/.

There are many free QGIS classes available. Class Central provides a list of 31 QGIS Courses.

You can also learn about Jupyter Notebooks from books--a few are available for free via AOK OneSearch. Here are some select ebooks on QGS:

Using Sage Research Methods to Find the Best Statistical Test for Your Data

Answer a few questions and Sage Research Methods Which Stats Test will guide you to the best statistical method to use for your data, and provide you with information about the method that it recommends

Research Data

Credits

Librarian

Introduction to Data Integration, Analysis, and Visualization

Cleaning Data for Analysis

Software Available to the UMBC Community

Students

Faculty & Staff

Open Source Software

Tableau Public is a free limited version of Tableau, a business intelligence app. With Tableau Public you can easily create visualizations, dashboards, and data stories, but everything in it is public, so it's not suitable for any private or confidential data.

Operation Systems Needed and Installation

Documentation, Classes, and other Sources of Information

OpenRefine is an open source tool that allows users to load data, clean it quickly and accurately, transform it, and even geocode it.

Key Features

Facets

Google Refine Expression Language

Operating Systems Needed and Installation

Documentation, Classes, and other Sources of Information

R is both an extensible language and environment for statistical computing and data visualization. It can be used with Jupyter Notebooks, described below.

Using R for Free Online: Create a free Posit account to use it in the cloud.

Operating Systems Needed and Installation

Documentation, Classes, and other Sources of Information

Python is a programming language that can be used to analyze, explore, and visualize data. Python can be used with Jupyter Notebooks, described below.

Operating System Needed and Installation

Documentation, Classes, and Sources of Information

The Jupyter Notebook is an interactive computing environment that enables users to author notebook documents that include code, interactive widgets, plots, narrative text, equations, images and even video!

Operating System Needed and Installing Jupyter Notebooks

Documentation, Classes, and other Sources of Information

Jupyter for Data Science

Operating System Needed and Installation

Documentation, Classes, and Sources of Information

Using Sage Research Methods to Find the Best Statistical Test for Your Data

Search & Find

Using the Library

Research Help

About AOK

Special Collections & Gallery