1. Data science and big data, 10.03.2021 10-12 am
Parallel to the digital transformation, a novel scientific discipline has been developed – data science. Data science allows new approaches for interdisciplinary (big) data analyses through complex algorithms and artificial intelligence (machine learning, deep learning etc.). Such approaches extract information from the data sets beyond the current scientific knowledge. Therefore, data science is of interest for nearly all research as well as industry/economy fields and often termed as a novel key discipline (e.g. Society of Informatics e.V., 2019). This course provides a basic overview about data science applications.
To produce reliable data science results a profound knowledge about the data analyses methods, data management techniques and innovative technologies is required. Additionally, to assess these results and approaches an awareness of their ethical, legal, and social implications is demanded (all topics are addressed in the following courses and operator tracks).
1. History (timeline comparison with CPU power and storage costs) & clarification of terms
- Statistics -> Machine Learning -> Deep Learning
- Data Mining -> Big Data
- Machine Learning vs. Artificial Intelligence
2. What is Data Science?
- Collection -> Analysis -> Visualization
- Machine Learning
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
- Big Data (data science with huge datasets, more memory of one PC required)
- Languages (i.e. python, R)
Basic overview about data science applications, methods, terms, tools and big data.
2. Philosophical reflections on data science, 17.03.2021 10-12 am
A critical awareness (“Critical Thinking”, see below) is crucial for an appropriate and reasonable assessment of data preparation, sharing and utilization in context of research data management, data protection and data science applications.
Furthermore, critical thinking establishes a common language across disciplines which is aware of limits or difficulties and thereby, essential for cooperative and future-oriented research.
Philosophy is often about “big concepts”; concepts such as knowledge, understanding, autonomy, transparency, intelligence, and creativity. And all these concepts are at stake in the context of current research in data science and artificial intelligence. It seems inescapable that we lose some of our own autonomy once our cars start driving autonomously and our houses become smarter and smarter. Computers outsmart us in number crunching since decades, but will they also outsmart us in creativity? Will they become the “better scientists” or will there always remain a difference between “pure prediction” and “real understanding”? Is predictive success acceptable even if it comes with a loss in transparency? After all, transparency is something we are very much worried about not only in science but in all kinds of political and societal contexts. At the same time, privacy and data protection laws are a major theme in public discourse as well. Consider tracking apps, for instance—do we really want to become transparent citizens and consumers, X-rayed as it were by a machine learning algorithm no one might actually understand? Further reading here.
Critical Thinking: Critical reflection of own research/work and development of empathy for other disciplines, their mindsets and ways to think.
- Schneider P et al. (2020) Rethinking drug design in the artificial intelligence era. Nature Reviews Drug Discovery 19, 353–364. doi:/10.1038/s41573-019-0050-3
- Boden, M. A. (1998). Creativity and artificial intelligence. Artificial Intelligence 103(1): 347-356
- Burge, T. (1998). Computer Proof, Apriori Knowledge, and Other Minds. Noûs, 32: 1-37. https://doi.org/10.1111/0029-4624.32.s12.1
- Iten, R., et al. (2020). Discovering Physical Concepts with Neural Networks. Physical Review Letters 124(1): 010508
3. Data and information management, 14.04.2021 9-12 am
A comprehensive management of research data is part of each research project and belongs to good scientific practice. It accompanies each phase of a research project – from the proposal phase via data acquisition and data analyses to the publication phase. The overall goal of research data management is the production of findable (F), accessible (A), Interoperable (I) and reusable (R) – FAIR - data sets.
A good stewardship of data (following the FAIR principles; Wilkinson et al., 2016) and an open data culture (Nosek et al., 2015) foster reproducibility as well as sustainability in science and makes up the fundament for data science applications
- Research data: Data life cycle and accompanied challenges
- Data management plans (DMP)
- FAIR data principle
- Meta data: standardization and its significance
- Archiving, publication and citation of research data sets
- Understanding for the significance of research data management and an overview about concepts and approaches
- Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
- Wilkinson, M. D. et al. Comment: A design framework and exemplar metrics for FAIRness. Sci. Data 5, 1–4 (2018).
- Hodson, S. et al. Turning FAIR data into reality: interim report from the European Commission Expert Group on FAIR data (Version Interim draft). Interim Rep. from Eur. Comm. Expert Gr. FAIR data (2018). https://doi.org/10.5281/zenodo.1285272
- Collins, S. et al. FAIR Data Action Plan. Interim Recomm. actions from Eur. Comm. Expert Gr. FAIR data 1–21 (2018). https://doi.org/10.5281/zenodo.1285290
- Wilkinson, M. D. et al. Interoperability and FAIRness through a novel combination of Web technologies. PeerJ Comput. Sci. 3, e110 (2017).
- Mons, B. et al. Cloudy, increasingly FAIR; Revisiting the FAIR Data guiding principles for the European Open Science Cloud. Inf. Serv. Use 37, 49–56 (2017).
4. Data protection and licenses, 21.04.2021 2-4 pm
Compliance with legal requirements in the handling of research data is an indispensable requirement for the long-term success of research data management.
Legal framework of research data management with a special focus on questions of copyright law and data protection law.
Acquiring basic legal knowledge of the possibilities and limitations of research data management.
- Christen/Ranbaduge/Schnell, Linking Sensitive Data. Methods and Techniques for Practical Privacy-Preserving Information Sharing, Chapter 2, 2020
- Donnelly/McDonagh, Research, Consent and the GDPR Exemption, European journal of health law, 2019, p. 97
- Ducato, Data protection, scientific research, and the role of information, Computer Law & Security Review, 2020, 105412
6. Managing qualitative data, 6.5.2021 9-12 am
The term “qualitative data” is used to describe a broad variety of heterogeneous data, including various types of text (e.g. transcripts of interviews or observations), audio, video, picture or material artefacts. From the perspective of “quantitative” research–i.e., the application of statistical methods to standardized numerical data –, qualitative materials just seem to be data that need more structure. But qualitative material is a specific type of data that is usually richer, more context-dependent and more sensitive than quantitative data. On the other hand, qualitative data can be fruitfully analyzed with common tools of quantitative inquiry (e.g. text mining). Thus, this lecture addresses both quantitative and qualitative researchers and aims to introduce them to the particular ethical, legal and practical challenges of qualitative materials –e.g. in terms of data protection, informed consent, anonymization, documentation and data sharing– to outline good practices as well as examples of fruitful data management and analysis.
1. Introducing qualitative data and research
- What is qualitative data? Characteristics and examples
- The great divide? Quantitative and qualitative data and research processes
- Theory, context and data in qualitative inquiry
- Mixing qualitative and quantitative data and research
2. Challenges of managing qualitative data
- Ethical aspects
- Legal aspects
- Practical aspects
3. Managing qualitative data in practice
- Collecting data
- Organizing data
- Transforming data
- Anonymizing data
- Contextualizing data
- Sharing data
4. Mixing qualitative and quantitative data and methods –examples
Basic overview about qualitative data: characteristics, challenges, approaches, benefits.
- Corti, Louise; van den Eynden, Veerle; Bishop, Libby; Woollard, Matthew (2020): Managing and sharing research data: A guide to good practice. 2nd ed. Los Angeles: SAGE Publications.
7. Statistical thinking, 10.05.2021 10-12 am
Data science approaches are based on statistical/mathematical methods as well as computer science competences. In this context, it is crucial to understand the basic principles of statistical methods. This will help to adequately apply statistical methods and to produce reliable statistical results.
This course provides an introduction into statistical basics and concepts relevant for data science applications. After a brief presentation of the categories of statistics (descriptive, predictive, confirmatory) and their general ideas, selected basic methods will be explained and illustrated by practical examples: concept of probability, parameter estimation, confidence intervals and testing of hypotheses.
A basic understanding of the major statistical principles.
- Fahrmeir, Heumann, Künstler, Pigeot, Tutz (2016). Statistik – Der Weg zur Datenanalyse, 8. Auflage, Springer-Verlag, Berlin, Heidelberg.
- Fahrmeir, Künstler, Pigeot, Tutz, Caputo, Lang (2009). Arbeitsbuch Statistik, 5. Auflage, Springer-Verlag, Berlin, Heidelberg.
- Freedman, Pisani, Purves (1998). Statistics, 3rd edition, W.W. Norton and Company, New York.
- Spiegelhalter (2019). The Art of Statistics: Learning from Data, Pelican, London.
8. Asking the right research questions in data science, 18.05.2021 9-12 am
“An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question” said the renowned statistician John Tukey as early as 1969.
Based on my own experience in statistical consultations, much confusion occurs due to a mismatch between research question and data/methods. However, even more fundamentally, the research question is often not even clearly articulated at the outset – perhaps because researchers anticipate that the right question can only be answered approximately. But how can we discuss what data and methods are suitable, if we are unclear or vague about the question to be answered? It seems that now, in the era of big data characterised by an abundance of data and a similar abundance of methods for analysing the data, the issue of asking the right question receives a new urgency.
In this course we will discuss the different types of research questions one might face in a variety of applied fields within data science, such as psychology, epidemiology, genetics, or political & social sciences. Key distinctions concern questions that are (i) descriptive, (ii) predictive, or (iii) causal (i.e. about counterfactual prediction). We will consider how these types of research questions are interrelated with the choices / requirements of data, methods of analysis, and the need for more or less specific subject matter background knowledge. We will see how starting with a clear and explicit research question helps with assessing, and maybe avoiding, potential sources of (structural) bias in answering that research question.
Key topics that will be covered:
- Types of research questions (descriptive, predictive, causal/counterfactual)
- Issues of validity and structural bias (e.g. selection, confounding, ascertainment)
- The target trial principle
Upon completion, participants of the course will be able to
- categorise research questions as descriptive, predictive or causal
- elicit a research question by formulating a target trial
- determine implications for the required data and choice of appropriate methods
- identify possible threats to validity / sources of structural bias.
Some prior exposure to or experience with analysing data will be helpful.
Miguel A. Hernán, John Hsu & Brian Healy (2019) A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks, CHANCE, 32:1, 42-49.
Miguel A. Hernán, James M. Robins, Using Big Data to Emulate a Target Trial When a Randomized Trial Is Not Available, American Journal of Epidemiology, Volume 183, Issue 8, 15 April 2016, Pages 758–764.
9. About the meaningfulness of data, 02.06.2021 10-12 am
Data are not, as etymology suggests, „the given“, but they are generated, constructed,made, and sometimes „set up“ in the bad understanding of the wording. There is, of course, a (hopefully) large part to it that originates from the subject or the phenomenon under scrutinity, and that’s what you’re after: the true score if you embrace this concept. There also is, however, a substantial share that is not merely due to random „error“, but that is added sytematically, as being added by you, as a result of decisisions you make in the process of obtainingand processing data. So your data may be technically clean, but, epistemologically, things are far more complicated. It is common to consider data as answers, but we know that the question determines, to a degree, the answer. As a consequence, when it comes to „meaning“, it is essential to reflect on the nature of the questions and see to what drives the ones proposing them, p.ex. in terms of a paradigm. This is why, when you want to become a scientist, they teach you how to ask good questions, and here, „good“ does not pertain to sensibility in a content - related, intellectual manner but to the way these questions are set up: on sound theoretical grounds, abiding by the rules of logic, targeting precise hypotheses, including all relevant parameters (and relevance is a major issue). So much for theory, and off you go, further down the rocky road of practical research, deciding on hundreds of options as regards erm, design logic, reliable and valid measurement, sampling, coding, preprocessing of data, choice of analytical models and implementation tools, and finally, interpretation of results with back - reference to question and theory. Some of your decisions are accounted for in your study protocol, and some are not. There may be some that you are not even aware of, and a psychologist will tell you that there’s sort of a purpose to unawareness. This session is titled „the meaningfulness of data. We need to discuss how meaning relates to data, or results of data analysis. For a start, let’s assume that there is no meaning IN the data, but that meaning happens to data, it is attached to it. In fact, YOU attach it, and therefore you must assume liability for it in both the scientific and the legal sense.
- Working definitions: data, meaning, and models.
- Central thesis: meaning is not in facts, but in human reasoning.
- Rolling it up from behind: statistics.
- Measurement: how to translate meaning in phenomena into numbers.
- Modeling: how to reduce meaning into models for information.
- Norming: how to scale and compare meaning using data
- Caveat: no data how to retrieve meaning from essentially nothing.
- Caveat: outliers how to decide on who’s hot and who’s not in meaning.
- Caveat: graphics how to put meaning in the eye of the beholder (or not)
4. Meaning and liability
Since, as a psychologist and statistician, I cannot claim expertise in your respective field of work, I will not, and cannot, tell you how to “do it right”. But the patterns behind „doing it wrong“ are quite universal: a moody remark that is warranted by 25 years of statistical consulting. My aim is to create awareness, make the implicite explicit, and foster a critical mindset when it comes to relating data and meaning in your specific discipline. You are welcome to bring along your own doubts and questions, or stories on big misunderstandings of data.
10. Computer sciences basics for data science, 08.06.2021 10-12 am
Computer science is a key component for data science applications and research data management as methods and procedures rely on it. For instance, to enable fast access to information, data sets must be stored efficiently in data structures. Clever modelling and algorithmic processing hereby guarantee a fast search and selection of information of even big data sets. This course will provide insights into computer science basics and gives an overview about relevant topics for data science.
- Computer science and its subdisciplines: applied, technical, practical, theoretical
- Programming languages
- Data storage and -processing
- Data structures
- Example: Sorting (Bubble Sort, Merge Sort, Quicksort)
Basic overview about computer sciences and its subdisciplines; basics in system engineering.
11. Programming languages, 16.06.2021 10-12 am
Programming is the essential tool for managing data sets and conducting data science methods. It is crucial for
- documentation work
- data preparation
- quality control of data sets
- data analyses
- transforming data into graphics
and makes handling of even big data sets possible.
- What actually is a programming language? What characterizes a programming language and what is it for?
- Why is HTLM not a programming language and what has Turing to do with it?
Approximately 700 programming languages exist - how to keep an overview? We learn to distinguish languages from their degree of abstraction and programming paradigm (imperative, procedural, object-oriented, functional, logical, …) or their area of application. Further, we talk about which programming languages you should know and some of them are briefly presented in this course.
Overview about programming languages, their features, significance and criteria for distinction.
12. Cryptography basics, 22.06.2021 10-12 am
Cryptography is the key technology to ensure the security and privacy of IT-systems. The understanding of basic principles of cryptographic functions is an indispensable prerequisite for the development of modern IT systems.
This course will provide elementary knowledge in cryptography (in theory and practice). For example: asymmetric vs. symmetric encryption, cryptographic hash functions, digital signatures and public-key infrastructures, post-quantum cryptography.
Basic knowledge in cryptography, which in particular allows to assess the strength of cryptographic methods in practice.
C.Paar, J.Pelzl: Understanding Cryptography A.J. Menezes et al: Handbook of Applied Cryptography
13. Security & Privacy, 29.06.2021 10-12 am
Security and privacy are key aspects in developing and maintaining trustworthy systems. A lack of security results in vulnerable systems exposed unprotected to potential attackers and presenting an incalculable economical and personal risk. As personal data has become the new currency in the digital era, its protection from unauthorized processing and distribution is a key issue to preserve the privacy and self-determination of individuals.
Techniques to measure and enhance the security / privacy of IT-systems
- Security: security protocols, security policies and their enforcement
(e.g. access-control, dataflow control)
- Privacy: GDPR, privacy-enhancing techniques
(e.g. differential privacy, k-anonymity)
This course provides basic knowledge in security and privacy techniques and sketches their underlying foundations.