Hands On RDM
Spreadsheets: Unification, Standardization and Documentation
You find the notepad of today’s session:
http://www.crc1382.org/pad04
2 HandsOn I: Analyze
2.1 Task: Have a look at a spreadsheet (10 min)
Download the spreadsheet Sample-Data-Sheet-v01.xlsx
from:
https://rwth-aachen.sciebo.de/s/zfeTG4KQ6OlTAzJ
All information known is that this is the data set of a study in which patients are taken blood from on several days.
Furthermore it is stated whether they had fever, ARDS
etc.
However there are some things odd and problematic in this data sheet…
- There are several issues with this data set. Discuss the problems in your group.
- Identify at least five problems.
- Nominate one spokesperson of your group to preset the problems you found.
3 HandsOn II: Unify and Standardize
Keep in mind two keywords: Unification and Standardization.
3.1 Unification
- One word - one meaning
(e.g.
female
instead ofweiblich
,w
,F
,Female
,XX
etc.) - Use controlled vocabulary / dropdown list: For each cell only specific values are allowed which are defined up front.
(e.g. valid values are only
female
,male
,divers
,null
)
3.2 Standardization
- Use descriptions/names common used in your research field (see Useful links); e.g. abbreviations of blood count (Hg, Wbc); ICD-10 (A00.- = Cholera) etc.
3.3 Reason
- Improve your data quality.
- Make your data readable
- Make your data interoperable
3.4 Task: Reorganize and clean the data set (30 min).
For example:
- Harmonize the font (e.g. you can use a typewriter font)
- Color the value of columns according to their value
- use one date format (e.g. ISO-8601:
YYYY-MM-DD
) - introduce controlled vocabulary / dropdown list (e.g.
sex
(male
,female
);ARDS
(true
,false
),Sample_handling
(fresh
,frozen
)); see Useful links.
4 HandsOn III: Document
Documentation is an important feature to understand data.
4.1 Data dictionary
- A data dictionary is an explaination sheet to your data.
- Each parameter (column) of your data spreadsheet needs to be explained.
column | (former column name) | description | origin of value | controlled vocabulary | allowed values |
---|---|---|---|---|---|
ID | Patientennummer | Unique identifier for patients | clinic | no | integer |
date-birth | Geburtstag des Patienten | Date of birth of patient in study | clinic | no | date, YYYY-MM-DD |
COVIDinfection | Covid-Erkrankung | Check whether patient was infected by Covid19 | Biobank | yes | true, false |
sample-condition | Probenaggregatzustand | Condition of a sample before analysing it | laboratory | yes | liquid, frozen, smoked |
4.2 Task: Create a data dictionary (20 min)
- Create a new spread sheet
- Document your data by explaining the columns and its allowed values
5 Useful links
- Standards and Tools for Data Quality Assessment: https://dataquality.ship-med.uni-greifswald.de/
5.1 Controlled vocabulary (dropdown list)
5.2 Cheat sheet
If you have no idea about the tasks in the breakout rooms, ask your host (Lukas).
You can also have a look at Newsletter No. 10 at CRC1382