Many disciplines have file naming recommendations, for example: DOE’s Atmospheric Radiation Measurement (ARM) program. Check for these.
Ideally, file types for a project should be standard, non-proprietary, and open source. If these features are not possible, at the very least file format selection should be made with sustainability and long-term use in mind. Try opening a Windows 95 Word Document on your modern computer, and you'll understand why (hint: you will get only wingdings!)!
Software often relies on proprietary file formats that do not last long as new versions are created or tools lose relevance. Where possible, export data files to stable formats for long-term access to your data, or convert proprietary files into equivalent standardized files that will be able to represent that data (like going from .xlsx to .csv).
A proprietary format can refer to:
- a file format that contains data that is ordered and stored according to a particular encoding-scheme, designed by the authors to be secret
- The secrecy means that specific hardware and software (designed and sold by the authors) can interpret the format better than others (like opening a .psd file in PhotoShop is more seamless than in Glimpse)
- a file format that is openly documented but whose use is restricted through licenses
An open format is:
- An open format is one that is platform independent, machine readable, and made available to the public without restrictions that would impede the re-use of that information". -- Open Government Directive
- a file format defined by a published specification maintained by a standards organization, and which has no restrictions on its usage (e.g. not restricted by copyright)
- There are no restrictions on the type of software or hardware that can use these by design (like how a .csv can be used by Google Sheets, Excel, and LibreOffice Calc)
Examples of sustainable formats
Long-term formats for data
Text
-
XML (.xml)
-
HTML (.htm)
-
OpenDocument Format (e.g. OpenDocument Text, .odt)
-
Plain text (.txt)
-
Markdown and other human-readable markup languages deploying plain-text editing
Tabular
Media
Geospatial
-
ESRI Shapefiles and supporting files (.shp, .shx, .dbf, .prj, .sbx, .sbn)
-
KML (.kml)
-
GML (.gml)
-
GeoTIFF (.tif, .tfw)
Mid-term formats for data
Text
Statistical
-
SPSS portable format (.por)
-
R file formats, i.e. script files (.R) data (.Rda, .Rdata) or markdown files (.Rmd)
-
Stata file formats, i.e. do-files (.do) and data files (.dta)
-
SAS file formats (.sas, .xpt, etc.)
Media
-
JPEG (.jpeg, .jpg)
-
MP3 (.mp3)
-
Photoshop files (.psd)
Geospatial
Encoding
Where possible given the limits of file formatting, encoding should be done using the Unicode system (UTF-8 or UTF-16), or using the older ASCII system that has been incorporated into Unicode.