All Guides: Research Data: Data Collection

Software

UMBC has a campus-wide subscription to Lab Archives for data collection and research documentation. It allows all research team members to work together and communicate, with granular access control, change history, disaster-recovery back-ups. and more.

Documentation

Start early! Careful planning of your documentation at the beginning of your project helps you save time and effort. Do not leave the documentation for the very end of your project. Remember to include procedures for documentation in your data management planning.
Think about the information that is needed in order to understand the data. What will other researchers and re-users need in order to understand your data?
Make a note of all file names and formats associated with the project, how the data is organized, how the data was generated (including any equipment or software used), and information about how the data has been altered or processed.
Create a separate documentation file for the data that includes the basic information about the data. Include where you got the data so that you and others can find it. Include how the data was generated, equipment or software used, and experimental protocol.,. Also include an explanation of codes, abbreviations, or variables used in the data or in the file naming structure. You can also create similar files for each data set. Remember to organize your files so that there is a connection between the documentation file and the data sets.
Plan where to deposit the data after the completion of the project. The repository probably follows a specific metadata standard that you can adopt.
Document consistently throughout the project. Data documentation gives contextual information about your dataset(s). It specifies the aims and objectives of the original project and harbors explanatory material including the data source, data collection methodology and process, dataset structure and technical information. Rich and structured information helps you to identify a dataset and make choices about its content and usability.

Good quality documentation allows others to understand and use your data. Documentation can include:

Interview protocols
Questionnaires & interviewer instructions
Codebook or data dictionary
Information sheets, Consent forms, Ethical approval
Database schemas
Methodology reports
Provenance information about sources of derived or digitized data

The project-level documentation explains the aims of the study, what the research questions/hypotheses are, what methodologies were being used, what instruments and measures were being used, etc. In the accordion the questions that your project-level documentation should answer are stated in more detail:

For what purpose was the data created
What does the dataset contain?
How was data collected?
Who collected the data and when?
How was the data processed?
What possible manipulations were done to the data?
What were the quality assurance procedures?
How can the data be accessed?

Collecting this information in one document will help where new members join a research team, when writing up a paper or if you plan on sharing your data at the end of the project.

Data-level or object-level documentation provides information at the level of individual objects such as pictures or interview transcripts or variables in a database. You can embed data-level information in data files. For example, in interviews, it is best to write down the contextual and descriptive information about each interview at the beginning of each file. And for quantitative data variable, value names can be embedded within the data file itself.

For quantitative data document the following information is needed:

Information about the data file
- Data type, file type, and format, size, data processing scripts
Information about the variables in the file
- The names, labels and descriptions of variables, their values, a description of derived variables or, if applicable, frequencies, basic contingencies etc. The exact original wording of the question should also be available.
- Variable labels should:
  - Be brief with a maximum of 80 characters
  - Indicate the unit of measurement, where applicable
  - Reference the question number of a survey or questionnaire, where applicable

For qualitative data document the following information as needed:

Textual data, for example interviews, include key information of participants such as age, gender, occupation, location, relevant contextual information
For qualitative data collections (for example image or interview collections) you may wish to provide a data list that provides information that enables the identifying and locating of relevant items within a data collection:
- The list contains key biographical characteristics and thematic features of participants such as age, gender, occupation or location, and identifying details of the data items;
- For image collections, the list holds key features for each item;
- The list is created from an initial list of interviews, field notes or other materials provided by the data depositor.

Organizing Project Folders

Put each project in its own directory, which is named after the project and perhaps prepended with that YYYY-MM-DD of when the project started.
Don't try to document the date, time, quality or other characteristics in the file structure--use a READ.me file for this.
Put text documents and relevant supplementary documentation associated with the project in the docs folder.
Put raw data and metadata in the data folder (which should be read-only, do not change your raw data directly!)
Files generated during cleanup and analysis (like processed data or visualizations) in a results folder.
Put source for the project’s scripts and programs in the src folder.

File Format Selection, Naming, and Version Control

Many disciplines have file naming recommendations, for example: DOE’s Atmospheric Radiation Measurement (ARM) program. Check for these.

Ideally, file types for a project should be standard, non-proprietary, and open source. If these features are not possible, at the very least file format selection should be made with sustainability and long-term use in mind. Try opening a Windows 95 Word Document on your modern computer, and you'll understand why (hint: you will get only wingdings!)!

Software often relies on proprietary file formats that do not last long as new versions are created or tools lose relevance. Where possible, export data files to stable formats for long-term access to your data, or convert proprietary files into equivalent standardized files that will be able to represent that data (like going from .xlsx to .csv).

A proprietary format can refer to:

a file format that contains data that is ordered and stored according to a particular encoding-scheme, designed by the authors to be secret
- The secrecy means that specific hardware and software (designed and sold by the authors) can interpret the format better than others (like opening a .psd file in PhotoShop is more seamless than in Glimpse)
a file format that is openly documented but whose use is restricted through licenses

An open format is:

An open format is one that is platform independent, machine readable, and made available to the public without restrictions that would impede the re-use of that information". -- Open Government Directive
a file format defined by a published specification maintained by a standards organization, and which has no restrictions on its usage (e.g. not restricted by copyright)
- There are no restrictions on the type of software or hardware that can use these by design (like how a .csv can be used by Google Sheets, Excel, and LibreOffice Calc)

Examples of sustainable formats

Long-term formats for data

Text

XML (.xml)
HTML (.htm)
OpenDocument Format (e.g. OpenDocument Text, .odt)
Plain text (.txt)
Markdown and other human-readable markup languages deploying plain-text editing

Tabular

Character-delimited files such as Comma Separated Value (.csv) or Tab Delimited (.tab)
XML
Plain text (.txt)
JSON

Media

Uncompressed TIFF (.tif)
JPEG 2000 (.mj2)
MPEG-4 (.mp4)
Free Lossless Audio Codec (.flac)

Geospatial

ESRI Shapefiles and supporting files (.shp, .shx, .dbf, .prj, .sbx, .sbn)
KML (.kml)
GML (.gml)
GeoTIFF (.tif, .tfw)

Mid-term formats for data

Text

PDF/A

Statistical

SPSS portable format (.por)
R file formats, i.e. script files (.R) data (.Rda, .Rdata) or markdown files (.Rmd)
Stata file formats, i.e. do-files (.do) and data files (.dta)
SAS file formats (.sas, .xpt, etc.)

Media

JPEG (.jpeg, .jpg)
MP3 (.mp3)
Photoshop files (.psd)

Geospatial

ESRI geodatabase formats (.gdb, .mdb)

Encoding

Where possible given the limits of file formatting, encoding should be done using the Unicode system (UTF-8 or UTF-16), or using the older ASCII system that has been incorporated into Unicode.

Research data files and folders should be labelled and organized in a systematic and consistent way so that they are easy to find, both for you and others in your research team. There is no one recommended way to name your files and folders, but consistency is key.

It’s generally useful to aim for file and folder names which are concise, but informative – it makes life easier if you can tell what’s in a file without having to open it.

Avoid special characters like &, %, $, #, @, and *. Just use letters and numbers.
Do not make file identity dependent on capitalization unless implementing camel case (e.g. fileName.xml).
Never use spaces in filenames – many systems and software will not recognize them or will give errors unless such filenames are treated specially. Use an underscore _ instead of a space.
Use short file names. For your sake and the sake of systems that’ll fail if you give it like a 50 character file name.
Use 001, 002, 003 instead of 1, 2, 3 to help sort and search through the data more effectively.
Choose file names that are recognizable to humans and that make sense within the project environment
Be consistent

Elements of a file name can include:

A project acronym
Content description
File type information
Date (YYYY-MM-DD)
Creator name or initials
Version number
Status info, e.g. draft

Operating systems usually default to sorting files alphabetically, so it can be helpful to think about what comes at the start of the file name – is it more useful to order the files by date, by author, or by subject, for example?

The benefit of consistent naming of data files is that it is easier to identify all files connected to one data collection event (e.g. one interview). The files related to one collection event (e.g. audio tape, its transcription and photographs that were taken by the interviewee) can be connected by the file name.

Example:

20190311_interview2_audio.wav
20190311_interview2_trans.txt
20190311_interview2_image.jpg

Version control is "the management of changes to documents, computer programs, large web sites, and other collections of information." (Wikipedia). It's a way that we can keep track of our projects across time, space, different users, and different systems. Nothing that is version controlled is ever lost, and there is a record of changes, who made them, and when. Systems with version control can notify users when there is conflict between one person's work and another's.

Managing different versions of your data can be achieved by:

Uniquely identifying different versions of files using a systematic naming convention, such as using version numbers or dates
- Record the date within the file: 20190902_documentation_for_my_data
- Include a version number in the file name: Documentation_v2
- Include information about the status of the file, e.g. "draft" or "final," as long as you don't end up with confusing names like "final2" or "final_revised".
- Include information about what changes were made, e.g. "cropped" or "normalized".
Using version control facilities within the software you use
Using file-sharing services with incorporated version control
Designing and using a version control table

File name	Changes to file
Interviewschedule_1.0	Original document
Interviewschedule_1.1	Minor revisions made
Interviewschedule_1.2	Further minor revisions
Interviewschedule_2.0	Substantive changes

Research Data

Credits

Librarian

Software

Documentation

Organizing Project Folders

File Format Selection, Naming, and Version Control

Examples of sustainable formats

Long-term formats for data

Mid-term formats for data

Encoding

Elements of a file name can include:

Search & Find

Using the Library

Research Help

About AOK

Special Collections & Gallery