Hands On RDM
Spreadsheets: Unification, Standardization and Documentation
You find the notepad of today’s session:
http://www.crc1382.org/pad04

1 Topic

1.1 Warm up: Mentimeter questions

2 HandsOn I: Analyze

2.1 Task: Have a look at a spreadsheet (10 min)

Download the spreadsheet Sample-Data-Sheet-v01.xlsx from: https://rwth-aachen.sciebo.de/s/zfeTG4KQ6OlTAzJ

All information known is that this is the data set of a study in which patients are taken blood from on several days. Furthermore it is stated whether they had fever, ARDS etc.

However there are some things odd and problematic in this data sheet…

  1. There are several issues with this data set. Discuss the problems in your group.
  2. Identify at least five problems.
  3. Nominate one spokesperson of your group to preset the problems you found.

3 HandsOn II: Unify and Standardize

Keep in mind two keywords: Unification and Standardization.

3.1 Unification

  • One word - one meaning (e.g. female instead of weiblich, w, F, Female, XX etc.)
  • Use controlled vocabulary / dropdown list: For each cell only specific values are allowed which are defined up front. (e.g. valid values are only female, male, divers, null)

3.2 Standardization

3.3 Reason

  • Improve your data quality.
  • Make your data readable
  • Make your data interoperable

3.4 Task: Reorganize and clean the data set (30 min).

For example:

  • Harmonize the font (e.g. you can use a typewriter font)
  • Color the value of columns according to their value
  • use one date format (e.g. ISO-8601: YYYY-MM-DD)
  • introduce controlled vocabulary / dropdown list (e.g.
    • sex (male, female);
    • ARDS (true, false),
    • Sample_handling (fresh, frozen)); see Useful links.

4 HandsOn III: Document

Documentation is an important feature to understand data.

4.1 Data dictionary

  • A data dictionary is an explaination sheet to your data.
  • Each parameter (column) of your data spreadsheet needs to be explained.
column (former column name) description origin of value controlled vocabulary allowed values
ID Patientennummer Unique identifier for patients clinic no integer
date-birth Geburtstag des Patienten Date of birth of patient in study clinic no date, YYYY-MM-DD
COVIDinfection Covid-Erkrankung Check whether patient was infected by Covid19 Biobank yes true, false
sample-condition Probenaggregatzustand Condition of a sample before analysing it laboratory yes liquid, frozen, smoked

4.2 Task: Create a data dictionary (20 min)

  • Create a new spread sheet
  • Document your data by explaining the columns and its allowed values

5 Useful links

5.2 Cheat sheet

If you have no idea about the tasks in the breakout rooms, ask your host (Lukas).

You can also have a look at Newsletter No. 10 at CRC1382

Date: 2021-07-07

Author: Lukas C. Bossert

Created: 2021-07-07 Wed 12:04

Validate