Skip to Main Content

Research Data

Introduction to Data Integration, Analysis, and Visualization

Data integration, analysis, and visualization is achieved using mathematical and statistical software and programming tools. UMBC purchases a number of commercial tools for students, faculty, and staff to use, and there are additionally a number of powerful open source (free) tools. For the UMBC purchased tools, we've provided a link to DoIT's information on the tool, including a description of it's capabilities, installation instructions, and in some instances, information on how to get started with the tool. For the free tools, we've provided more in-depth information along with links to information and courses on using the tool.

A great UMBC resource is The Center for Interdisciplinary Research and Consulting (CIRC), which a consulting service for mathematics and statistics provided by the Department of Mathematics and Statistics. It provides workshops on R, Matlab, Stata, SPSS, etc.

Software Available to the UMBC Community

Open Source Software

Tableau Public is a free limited version of Tableau, a business intelligence app. With Tableau Public you can easily create visualizations, dashboards, and data stories, but everything in it is public, so it's not suitable for any private or confidential data.

 

Operation Systems Needed and Installation

Tableau Public is available for Windows and MacOS. Download from the Tableau Public site. When downloading is complete, click on the Tableau Public executable, and follow the prompts.

Documentation, Classes, and other Sources of Information

Examples of data visualizations created with Tableau Public are in the Tableau Gallery.

Tableau Public has a series of 22 short videos to teach people how to use it, available here: https://public.tableau.com/en-us/s/resources.Tableau Public help is available here: https://www.tableau.com/support/public.

An ebook on Tableau Public is available through the A.O.K. library:

OpenRefine is an open source tool that allows users to load data, clean it quickly and accurately, transform it, and even geocode it.

The main use of OpenRefine is data processing and transformation to other formats. What’s more, is that all actions that were done on a dataset are stored in a project and can be replayed on another dataset!

Why Use OpenRefine?

  • Simple installation
  • Extensive documentation
  • Lots of great import formats: TSV, CSV, XML, RDF Triples, JSON, Google Sheets, Excel
  • Upload from local drive or import from URL
  • Many export formats: TSV, CSV, Excel, HTML table
  • Works with large-ish datasets (100,000 rows). Can adjust memory allocation to accommodate larger datasets.
  • Data remains on your computer, so nothing is shared until you choose to share it.
  • Useful extensions: geoXtension, Opentree for phylogenetic trees from Open Tree of Life, and many more (listed here, scroll to ‘extensions’)!
  • Active development community

Key Features

Facets

One of the most powerful operations that OpenRefine has to offer are facets. When you look at facets for a given column, it shows all unique entries with frequencies. You can use that to get a feel for how consistent your data is. You can also use facets to subset rows that you want to change in bulk. The facet information always appears in the left hand panel in the OpenRefine interface. There are:

  • Numeric facets
  • Timeline facets (for dates)
  • Custom facets
  • Scatterplot facets

Some of the default custom facets are:

  • Word facet - this breaks down text into words and counts the number of records each word appears in
  • Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
  • Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column. This can be useful for spotting incorrect or unusual data in a field where specific lengths are expected (e.g. if the values are expected to be years, any row with a text length more than 4 for that column is likely to be incorrect)
  • Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.
Google Refine Expression Language

GREL stands for the Google Refine Expression Language, and it’s a way we can automate changes in OpenRefine. You can use GREL to query APIs, change data formats, split columns, and a whole lot more. OpenRefine lets you choose between GREL, Python or Jython (an implementation of python designed to run on the Java platform), or Clojure (dialect of the Lisp programming language). You can use GREL to mass-process data. You can use regular expressions in GREL to powerfully repurpose and redefine your data! A regular expression, regex, is a sequence of characters that define a search pattern. You can even use GREL to call Google Maps API to get lat/longs for datasets where you have addresses. The possibilities with GREL are endless!

Operating Systems Needed and Installation

OpenRefine runs on Windows, MacOS, and Linux. Installation Instructions and downloads of the program are available here: https://openrefine.org/download.html.

Documentation, Classes, and other Sources of Information

Both a FAQ and User Manual are available. 

A number of online courses are available:

An ebook on OpenRefine is available through the Library:

Using OpenRefine: the Essentail OpenRefine Guide that takes you from Data Analysis and Error Fixing to Linking your Dataset to the Web

R is both an extensible language and environment for statistical computing and data visualization. It can be used with Jupyter Notebooks, described below.

R includes:

  • data handling and storage
  • operators for calculations on arrays and matrices,
  • a large collection of  tools for data analysis,
  • graphics for data analysis and display 
  • a programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities
  • functions to define new capabilities
  • ability to link and call C, C++, and Fortran code
  • many extension packages

Operating Systems Needed and Installation

R can compile and run on Windows, MACOS, Linux, FreeBSD, and a variety of Unix platforms.

To install, first find a CRAN site on this page, https://cran.r-project.org/mirrors.html, then find the appropriate download . A variety of help documentation is available on the CRAN sites, and a manual that includes installation is included. Manuals that include installation instructions are also available here:https://cran.r-project.org/manuals.html (the most current version of the manuals are listed under R-release) and on other CRAN sites.

Documentation, Classes, and other Sources of Information

The r-Project websites, https://www.r-project.org/ includes manuals and an FAQ.

Codecademy and Coursera have free R tutorials and courses. There are many R courses available for a fee--Code Spaces has ranked courses and provided a Top Ten R Courses in 2021. You can also learn about R from books--many are available for free via AOK OneSearch. Here are some select ebooks on R:

There are more specialized ebooks on using R with  specific types of data and specific disciplines, so you might want to search the library catalog for a book that more specifically addresses what you want to do.

Python is a programming language that can be used to analyze, explore, and visualize data. Python can be used with Jupyter Notebooks, described below.

  • For those with programming experience, Python can be easier to use than R
  • Python is a full programming language and R is statistical software
  • Allows for machine learning
  • Allows for data and data analysis tasks to be integrate with the web
  • Those with statistics experience are better off using R

Operating System Needed and Installation

Python runs on Windows, Mac, and Linux. Python comes already installed on many computer. Instructions for seeing if it's already installed on your computer, and installing if it's not, are available here: https://wiki.python.org/moin/BeginnersGuide/Download.

Documentation, Classes, and Sources of Information

Because Python is a full programming language, we'll focus specifically on using Python for data--otherwise you'll learn how to do many thing things that you don't want to do! If you're new to python, and want to use it specifically for data, you should start with python courses specifically for data, such as DataCamp's or DataQuest's. These aren't free, but are self-paced, require writing real code, and use real data.

Python has a user guide. We don't recommend that beginners only interested in Python for data use the beginners guide for non-programmers--instead learn from something that focuses exclusively on using Python with data.

Pandas is the Python Data Analysis Library. Installation instructions and beginner's tutorials are available here: https://pandas.pydata.org/getting_started.html. It's user guide is here: https://pandas.pydata.org/docs/user_guide/index.html. There are a lot of tutorials and courses available for data in Python. Here are a few free ones:

The library also has many ebooks on using Python for data:

There are more specialized ebooks on using Python with specific types of data and specific disciplines, so you might want to search the library catalog for a book that more specifically addresses what you want to do.

The Jupyter Notebook is an interactive computing environment that enables users to author notebook documents that include code, interactive widgets, plots, narrative text, equations, images and even video!

It includes:

  • In-browser editing for code, with automatic syntax highlighting, indentation, and tab completion/introspection.

  • The ability to execute code from the browser, with the results of computations attached to the code which generated them.

  • Displaying the result of computation using rich media representations, such as HTML, LaTeX, PNG, SVG, etc. 

  • In-browser editing for rich text using the Markdown markup language, which can provide commentary for the code, and is not limited to plain text.

  • The ability to easily include mathematical notation within markdown cells using LaTeX, and rendered natively by MathJax.

The Jupyter Notebook combines three components:

  • The notebook web application: An interactive web application for writing and running code interactively and authoring notebook documents.

  • Kernels: Separate processes started by the notebook web application that runs users’ code in a given language (e.g. python, R, Julia, Go, and more -- get the full list of kernels from the wiki) and returns output back to the notebook web application. The kernel also handles things like computations for interactive widgets, tab completion and introspection.

  • Notebook documents: Self-contained documents that contain a representation of all content visible in the notebook web application, including inputs and outputs of the computations, narrative text, equations, images, and rich media representations of objects. Each notebook document has its own kernel. You can export your notebook as many other formats, even LaTex and PDF!

 

Operating System Needed and Installing Jupyter Notebooks

Jupyter Notebooks runs on Windows, MacOS, and Linux.

You can install Jupyter notebooks and some key kernels on your computer in a few ways:

Our recommended method is to download using Anaconda (make sure you select version 3.*), which gives you Jupyter, python 3, and a lot of key python libraries for research: https://www.anaconda.com/download/. After you've finished downloading + installing with Anaconda, you should see an application "Jupyter notebooks" in your list of applications.

If you're comfortable with the terminal you can also install Jupyter Notebooks with pip:

python3 -m pip install --upgrade pip
python3 -m pip install jupyter
jupyter notebooks # launches the notebook interface

Additional installations possibilities are available here: https://jupyter.readthedocs.io/en/latest/install.html

 

Documentation, Classes, and other Sources of Information

Jupyter Notebooks documentation is available here: https://jupyter-notebook.readthedocs.io/en/stable/index.html.

There are many free Jupytor Notebooks tutorials available to get you started. Here are a few:

There are also many Jupyter Notebooks courses available for a fee. Douglas Hollis provides a list of top 10 Jupyter Notebooks Training Courses.

You can also learn about Jupyter Notebooks from books--a few are available for free via AOK OneSearch. Here are some select ebooks on Jupyter

There are more specialized ebooks on using Jupyter Notebooks with specific types of data and specific disciplines, so you might want to search the library catalog for a book that more specifically addresses what you want to do.