Hands On RDM
Files: Formats, Standards, Conversions

1 Topic

2 Files

2.1 Formats and Standards

  • Files have formats.
  • Extension will (mostly) reveal the format.

2.1.1 Text

  • Microsoft Word (.doc vs. .docx)
  • Apple Pages (.pages)
  • LibreOffice Write (.odt)
  • Markdown (.md), Plain Text (.txt), LaTeX (.tex)

    What else do you use?

2.1.2 Spreadsheet

  • Microsoft Excel (.xls vs. .xlsx)
  • Apple Numbers (.numbers)
  • LibreOffice Calc (.ods)
  • CSV (.csv), TSV (.tsv) etc.

2.1.3 Images

  • Raster vs. Vector

    vector_raster.gif

    Figure 2: Difference of formats

  • Raster: (.jpg, .png)
  • Vector: (.pdf, .svg, .eps)

2.1.4 Code

Files for Code are (usually) open source.

Recommendation: JupyterNotebook (.ipynb)

  • Code and documentation (literate programming) in one file

2.1.5 Proprietary

  • Is there a documentation?
  • What do you need to open the file?
  • Can you convert the data?

2.2 Conversions and Archive

  • To ensure compatibility, convert your document.
  • Not every foormat is suitable for archiving!

For keeping your project files, use preferably:

  • PDF (.pdf)
  • TIFF (.tiff)
  • CSV (.csv)

3 HandsOn!-Session

There are several files in the folder "Toms-data-sets" (https://rwth-aachen.sciebo.de/s/MrqD1tXyXlkcuA7).

You will need to perform certain steps. Document the steps by writing it down / make screenshots etc. so that you or your colleague will understand what you did.

3.1 Text files

  1. Convert the file Medical Report Form.doc into a docx-file.
  2. Fill out the form and save it as pdf or convert it to pdf. Make sure that the filename is properly set (e.g. YYYY-MM-DD_Medical-Report-Form; have a look at last session: http://crc1382.org/rdm-docs/02_data-organization.html)

3.2 Spreedsheets

  1. Open the file encounter.csv with Excel/LibreOffice (or the spreedsheet program of your choice). Make sure that the file is loaded correctly with the proper encoding and column separation. (If you have no clue how to do that, have a look into the section Useful links).
  2. Get the timestamp value of cell G2808.
  3. Get the sum of the columns Q to X.
  4. Color column lab_results_count (R) relativly to the value (0 = red; the higher the number the greener the cell).
  5. Insert a new column (name it horizontal_count_sum) after column immunization_count (X). In this column calculate the sum of the columns Q to X (horizontally).
  6. Save this file as xlsx and save it as csv with tabs as delimiters, too. Name the file properly.

3.3 Images

  1. Make a picture of your computer with the opened files from above (either with a screenshot or with your cellphone).
  2. Save this picture in Toms-data-sets in the format jpg and tiff (you might need to convert it).

3.4 Archive and Reuse

  1. Rename the folder Toms-data-sets to e.g. <YOURNAME>-data-sets (replace <YOURNAME> with your name).
  2. Check if the content of the file README.txt is accurate (author’s information, file names etc.). Update this data documentation file.
  3. Make this folder archivable by zipping it (use zip etc.).
  4. Send this file to the members of your breakout-session. You can send it via email or save the file in e.g. Sciebo/OneDrive/SharePoint and share a download link.
  5. You will receive datasets, too. Un-zip the received archive and open all the files. Do you run into any error?

4 Useful links

In this section you find some links which might be helpful for this topic

4.1 Conversion tools

There are several conversion tools online available. Like always, check what the conditions are and what (personal) data the provider will store.

4.2 Tips and Tricks

Date: 2021-06-09

Author: Lukas C. Bossert

Created: 2021-06-09 Wed 13:02

Validate