Future Opportunities for Software in Research - 2022

Europe/Berlin
Lecture Hall (Max Planck Institute for Evolutionary Biology)

Lecture Hall

Max Planck Institute for Evolutionary Biology

August Thienemann Strasse 2 24306 Plön Germany
Description

Future Opportunities for Software in Research

Important links

 

Details

  • Date: 12th and 13th May 2022
  • Venue: hybrid, in Plön and online
  • Registration: Will open on 7th March 2022; limit for on-site attendance: max. 60 persons
  • Participation open to all academic and research institutions
  • Contribution Types:
    • lightning talks
    • tutorials
    • presentations / lectures

Call for Participation

Software literacy has become a key competence for scientists across all disciplines. Scientists use the software daily, and software development is becoming an increasingly important component of scientific productivity. However, the software needed for certain research projects can get highly complex and take up resources otherwise needed for core research. In the demanded professionalization of software development in research, specialized Research Software Engineers have emerged in recent years. With their help, researchers tackle the challenges in the areas of software and data, such as reproducibility, correctness, user-friendliness, performance, or maintenance.
Our two-day workshop provides new opportunities for learning about best practices in scientific software development, like:

  • Seeing recent flagship projects in action.
  • Discussing software licensing and intellectual property issues.
  • Discovering new ways to make your software known and recognized.

We invite all interested scientists, research software engineers, IT and computing specialists and individuals involved in creating, using or otherwise dealing with research software in the Max Planck Society. We also welcome participants from other research institutions.

Timeline and Location

  • February 28 - Deadline for contributions
  • March 7 - Registration opens (at most 60 persons in presence)
  • April 25 - Registration for in person presence closes
  • May 12 - Afternoon, main focus: research software applications
  • May 13 - Morning, main focus: research software engineering

The workshop will be held in hybrid form, so you will be able to connect remotly or come to Max Planck Institute for Evolutionary Biology in Plön (https://www.evolbio.mpg.de).

CoViD19 regulations

The organization committee observes and acts based on the current and changing CoVID19 regulations defined by the regional and federal authorities. At the time of writing, this means that 2G regulations apply: in-person participants must be vaccinated against or have recently recovered from a SARS-CoV2 infection. We recommend having a Coronavirus self-test immediately before travelling or in the morning of the first day. Last-minute changes in these regulations are possible may (in the worst case) lead to late cancellation of the event.

Themes and types of contribution

We are looking for prime examples where software-enabled research (day 1) and working practises in research software engineering (day 2).

Who should submit?

We welcome submissions from any people who have an interesting take on research software development, especially:

  • Researchers at any career stage who develop software for research purposes
  • Software developers working in a research context, whatever their job title or field, maybe
  • Those interested in advancing the understanding of how best to use and maintain research software, e.g. concerning openness, reproducibility, sustainability, scalability, or performance
  • Organizations providing tools, platforms, or services that foster research software, such as IT infrastructure providers or computing and data centres.

Formats

This workshop will only feature oral presentations (no posters).

Talks may have a length of 15-30 mins. + variable time for Q&A and discussion. If your talk is accepted, you will be notified about the talk length by the session chair.

Accommodation

Here are some recommendations for hotels in Plön:

Holiday appartments can be found here, for example: https://www.holsteinischeschweiz.de/hotels-fewo/accommodations

Organisation Team

  • Holger Dinkel (Max Planck Institute for Biological Cybernetics)
  • Conrad Droste (Max Planck Institute for Marine Microbiology)
  • Carsten Fortmann-Grote (Max Planck Institute for Evolutionary Biology)
  • Michael Franke (Max Planck Digital Library)
  • Maximilian Funk (General Headquarter of the Max Planck Society)
  • Yves Vincent Grossmann (Max Planck Digital Library)
  • Stephan Janosch (Max Planck Institute of Molecular Cell Biology and Genetics)
  • Sven Willner (Max Planck Institute for Meteorology)

 Local Organisation

  • Nikoleta Glynatsi
  • Carsten Fortmann-Grote
  • Maren Lehmann
  • Britta Baron

Contact

You can reach the organisers via email to research-software-workshop@lists.mpg.de

Registration
Registration form
    • Welcome
    • Session 1: Scientific Highlights I (Chair: Conrad Droste)

      Session chair: Conrad Droste

      • 1
        Diamond
        Speakers: Benjamin Buchfink (MPI Biology), Klaus Reuter (MPCDF)
      • 2
        Unravelling the DNA code - An application from Computational Biology in the brain

        SCENIC: https://www.nature.com/articles/nmeth.4463

        Speaker: Sara Aibar (Katholic University Leuven VIB Center for Brain & Disease Research)
      • 3
        Code Club at the MPI for Psychiatry

        At the Max Planck Institute of Psychiatry (MPIP) people with different programming skills, from wet-lab scientists to bioinformaticians, work together. Also, quite a lot of PhD students who haven't received a formal education in computer
        science perform primarily computational work. After more than a one and half year break, we revived the Code Club at the MPIP with monthly meetings. There we hold tutorials about programming topics, discuss problems encountered in the last month and find partners for code review. Our main goal is to increase the quality of the research software and scripts developed and used for our research at the MPIP. In the talk, I will give an overview about our motivation and the implementation of the Code Club.

        Speaker: Jonas Hagenberg (MPI Psychiatrie)
      • 4
        DynamicalSystems.jl - Nonlinear dynamics software for everyone

        DynamicalSystems.jl is an award-winning software library for nonlinear
        dynamics and nonlinear timeseries analysis. It was born out of
        frustration when facing two problems: (1) the lack of a general-purpose
        accessible software for nonlinear dynamics for using in the lecture
        hall, and (2) the complete and utter lack of reproducibility of the
        entire field. DynamicalSystems.jl was designed to address both of these
        problems, and also offer much more. It is structured in a way of an
        encyclopaedia on nonlinear dynamics, and was written with clarity of
        source code as the highest priority. By now, it has seen contributions
        by dozens of individuals initially unaffiliated with the library.
        Besides making nonlinear dynamics accessible, this has also enabled
        brand new kind of research, born on, and done entirely on, GitHub, an
        open source software collaboration platform. In this presentation I will
        discuss how DynamicalSystems.jl came to be, why it is the first software
        to succeed in what it does, and ultimately, how it can make the field of
        nonlinear dynamics reproducible.

        Speaker: George Datseris (MPI Meteorology)
    • 15:10
      Break
    • Session 2: Scientific Highlights II (Chair: Carsten Fortmann-Grote)
      • 5
        barbac: A versatile tool for quantifying barcoded populations

        Over the last few years, evolutionary biologists have tagged with short sequences the individuals of bacterial populations, yeast populations, and cancer cells to understand better the eco-evolutionary dynamics that shape their biodiversity, as well as the dynamics that govern microbial communities. A crucial technical stepping stone for understanding these questions is quantifying these barcoded populations after amplicon sequencing. We have developed barbac, an R-package that allows extracting, clustering, and merging barcodes in a time series. Compared to other similar pipelines, barbac offers the possibility of using several clustering methods, graphic visualisations of the metadata, an interactive format through a shiny app. The versatility of barbac allows having wide applications in bioinformatics and text analysis.

        Speakers: Dr Loukas Theodosiou (Max Planck Institute for Evolutionary Biology), Andrew Farr (MPI Evol bio), Paul Rainey (MPI Evolutionary Biology)
      • 6
        Research Software for Historical Language Comparison

        The field of traditional historical linguistics has long since applied
        various techniques for the historical comparison of languages which aim
        to reconstruct certain aspects of ancestral languages which are not
        witnessed in sources through the comparison of extant language varieties
        for which sources exist. Although the techniques are in theory highly
        formalized, they are up to now almost exclusively applied in a manual
        fashion and only certain aspects, such as reconstruction of language
        phylogenies have been automated so far. Our project on Computer-Assisted
        Language Comparison tries to design software tools which help linguists
        to automatize and formalize these tasks. Our strategy is to design
        complex algorithms which typically run in Python along with lightweight
        annotation tools which help scholars to annotate and analyze their data
        in a human- and machine-readable way. In the talk, we will introduce our
        tools and our strategies for research software design and illustrate
        them by providing concrete examples.

        Speaker: Johann-Mattis List (Max Planck Institute for Evolutionary Anthropology)
      • 7
        Ubermag

        Topics of this contribution:

        • Jupyter notebook as user interface
        • reproducibility, interactive documentation
        • special feature of ubermag: provides domain specific language (for micromagnetic research) how to express problem, which is then translated into configuration files for the simulation engine automatically
        • software engineering
        • github actions for CI
        • unit tests, system tests
        • testing notebooks through nbval
        • pre-commit hooks
        • software design: connect to existing libraries where possible, such as numpy, scipy, matplotlib, pandas, k3d, recently xarray
        • packaging:
        • pypi, conda-forge, spack?
        Speaker: Hans Fangohr (Max Planck Institute for Structure and Dynamics of Matter)
      • 8
        topwave - A scientific python package for topological properties of spin wave spectra

        Ever since the awarding of the Nobel Prize in Physics to J.M.
        Kosterlitz, D. J. Thouless and D. Haldane in 2016 for their pioneering
        work on topological phase transitions, these exotic phases of matter
        have become one of the most sought-after features in quantum materials.
        They are of particular interest for their unconventional and
        dissipationless transport properties, and are a key ingredient for many
        quantum computer blue prints. Recently, it was realized that not only
        electronic excitation spectra, but also those of magnets can exhibit
        nontrivial topology. The standard procedure for calculating these
        magnetic excitations is Linear Spin Wave Theory (LSWT). Although there
        is widely used MATLAB library to calculate these spectra, it lacks the
        tools to compute toplogical invariants and transport properties. My goal
        is to provide an open source python package to the scientific community
        that is able to calculate the magnetic excitations of any magnetic
        material, and provides the user with the tools to calculate transport
        properties and discover potential topological phases.

        Speaker: Mr Niclas Heinsdorf (Max Planck Institute for Solid-State Research)
      • 9
        Numerical integration on an implicit surface in the Brillouin zone of chiral materials for the calculation of the circular photogalvanic effect

        In this talk I will present the problem of calculating the circular
        photogalvanic effect in topological semimetals[1], a property for
        which we would like to study the effects of symmetry breaking of. One
        of the approaches explored in this ongoing project, different than in
        previous studies, is to evaluate the resulting implicit surface
        integrals directly. Previous studies used an artificial broadening of
        the implicit surface, which converts this problem to a 3D integration.
        It is solved by sampling a grid, which introduces rather high errors
        of up to 2% as the grid density is pushed to numerical limits[2]. We
        will discuss different possible solutions to this problem and the
        subtleties of implementing a numerical implicit surface integration.
        Suggestions on different unexplored approaches to the presented
        problems are welcome.

        [1] F. de Juan, et al. Quantized circular photogalvanic effect in Weyl
        semimetals. Nature communications 8.1 (2017): 1-7.

        [2] Z. Ni, et al. Giant topological longitudinal circular
        photo-galvanic effect in the chiral multifold semimetal CoSi. Nature
        communications 12.1 (2021): 1-8.

        Speaker: Kirill Alpin (MPI for Solid State Research)
      • 10
        Syncopy - Systems Neuroscience Computing in Python

        Syncopy (www.syncopy.org) is aimed to be a completely open source, user-friendly yet powerful data analysis suite for the Neurosciences. It is developed in Python and makes extensive use of distributed computing via Dask, and achieves low memory footprints by using on-disc hdf5 data structures in the backend per default. For our users, we supply highly abstracted frontend functions, which allow using the same analysis code irrespective of whether the code is run on their local machines or on an HPC cluster. We aim to interface with existing data formats (e.g. Neurodata Without Borders, NWB) and community tools (e.g. Fieldtrip), and foster reproducibility by creating and preserving a lot of meta-information during processing.

        Speaker: Gregor Mönke (Ernst Strüngmann Institute for Neuroscience)
    • 17:25
      Break
    • Session 3: Reproducibility and Sustainable Software Development
      • 11
        Software in Reproducible Science

        Software is both a cause of unreliable research and part of the solution. The bulk of scientific research relies upon specialized software for data management and analysis. The bad news is that much of this software is poorly tested and documented, and researchers often use software in unreliable ways. Part of the problem is that researchers are being asked to perform a job they have not been trained for: software development. The good news is that borrowing simple habits and open tools from software engineering brings huge benefits. Even more good news: Specialized curricula already exist to train scientists to develop and use these habits in their own research.

        Speaker: Richard McElreath (MPI Evolutionary Anthropology)
      • 12
        Automating Reproducibility - Challenges and what it takes to meet them

        Computational reproducibility is a building block for transparent and cumulative science. It enables the
        originator and other researchers, on other computers and later in time, to reproduce and thus understand
        how results came about while avoiding various errors that may lead to erroneous reporting of statistical
        and computational results. But what does it take to make something reproducible? Until recently, detailed
        descriptions of methods and analyses were the primary instrument for ensuring scientific reproducibility.
        Such manual description fails to ensure reproduction due to four different reasons that get more likely the
        more central computational methods are for research. To meet these challenges, we propose that researchers
        take advantage of four technological advancements—version control, dynamic document generation, workflow
        automation, and containerization. Our workflow enables scientists to achieve a more comprehensive standard
        that allows anyone to access a digital research repository and reproduce all computational steps from raw
        data to final report with a single command.

        Speakers: Aaron Peikert (MPIB Berlin), Andreas Brandmaier (MPIB Berlin), Maximilian Ernst (MPIB-Berlin)
      • 13
        Writing Extensible Software for Researchers - Principles and an Example in Julia

        Software for research has to keep up with the methodological developments in its
        field. All too often, only a handful of maintainers bear the load of maintaining and
        extending software. In consequence, they are swamped with demands for adding addi-
        tional features, resulting in long delays until new innovations become available.
        However, in many disciplines, methodological researchers are somewhat proficient
        in writing code and would in theory be able to contribute to existing software solutions.
        But often the existing code base is not easily extensible, and researchers instead write
        idiosyncratic ad-hoc software solutions for their specific tasks. In consequence, they
        reimplement large parts of existing software and add their required features, resulting
        in many similar software packages co-existing but not being compatible to each other.
        To solve this problem, it is necessary to design open software that is easily exten-
        sible by other researchers. I will show that this is not only about making source code
        publicly available, but about design patterns and software development methodologies.
        Many features of the Julia programming language make it ideal to meet these demands,
        while achieving high performance for compute-intensive tasks. In addition, I will show
        everything in action in the experimental Julia package StructuralEquationModels.jl.

        Speakers: Maximilian Ernst (MPIB-Berlin), Aaron Peikert (MPIB Berlin), Andreas Brandmaier (MPIB Berlin)
      • 14
        Advantages of Using GitLab and DevOps for Developing Scientific Software
        Speaker: Christine Muehleib (umsicht.fraunhofe.de)
      • 15
        Accessing HPC resources via RESTful API

        The usual mode of accessing High Performance Computing (HPC) resources involves
        interactively connecting to the command-line interface and submitting job
        scripts to a job scheduler.

        Some services which provide a user interface by themselves (e.g. when working
        with graphical data) or services which require HPC resources as a compute
        backend for an already existing workflow engine, can benefit from the
        integration of HPC resources via an API.

        In our prototype for such a solution, external services authenticate against
        this backend via RESTful API which then submits jobs to the HPC system's job
        scheduler on behalf of the user. Moreover, the job status and outcome is tracked
        and can be queried by the service and reflected back to the user.

        We showcase our architecture and prototype implementation as well as two usage
        scenarios, namely GitLab CI/CD workflows requiring a HPC toolchain and HPC
        integration with the workflow engine Flowable.

        Speaker: Christian Köhler (GWDG)
      • 16
        How to make software outlive the research project

        A lot of good software is abandoned once the PhD
        student graduates or the programmer leaves the institute. Also
        maintaining software doesn’t generate papers which is the unit used to
        measure scientific prowess. I believe there are a few low-hanging fruits
        that scientific programmers can use to improve the state of research
        software development.

        Speaker: Fabian Klötzl
      • 17
        Carpentries: The benefits of a collaborative approach to research software skills training
        Speaker: Toby Hodges (Carpentries)
    • Dinner
    • 21:00
      Discussion
    • Session 4: Open Software, Open Data, Open Science
      • 18
        Why open source community building makes your research software better - and what to expect when you get started
        Speaker: Yo Yehudi
      • 19
        What is research software?

        Research software is vital to research.

        In this talk I will discuss what could be meant by the term "research software".

        I will show some examples of research software in the field of game theory (and
        others) but
        also give an overview of what the published literature says on the topic.

        This will conclude with a proposal for a definition of research software with a
        discussion of the inherent difficulties and opportunities.

        Speaker: Vince Knight (Cardiff University)
      • 20
        Research Software Development at Helmholtz
        Speaker: Uwe Konrad (Helmholtz-Zentrum Dresden-Rossendorf)
      • 21
        Gitlab at MPCDF
        Speaker: Raphael Ritz (MPCDF)
      • 22
        How to bring industry standards to your research software development
        Speaker: Jean-Claude Passy
    • 10:35
      Coffee Break
    • Session 5: Challenges and Solutions
      • 23
        OSS in Climate Modelling, Introducing a Software Policy at an MPI

        The Climate Model „ICON“ has been developed at MPI for Meteorology for climate and weather
        research.
        The model consist of approx. 500 k lines of (much legacy) code developed by 100s of people
        across the world and is under constant change due to porting to the most modern HPC
        architectures. Code owner are 4 institutions: MPI-M, DWD, KIT and DKRZ
        These institutions had been developing the code on the legal basis of cooperation agreements.
        A license specifically drafted by a law firm as “open-source-like”, has not proven useful . Among
        the problems with this license were the facts that it has not been accepted by many journals,
        and put up too high hurdles for the community of climate scientists to pick up the code easily
        and modify it for their use cases: Proliferation of the code into the community was unsatisfyingly
        low.
        After these experiences the idea was accepted, that licensing the code as open source software
        would solve many of such problems.
        The selection of the most appropriate form of OSL rose quite some discussion amongst the
        parties involved: We opted for the permissive BSD. We drafted two documents, a software
        policy and a contributors license agreement that would enable MPI-M to license all their
        software development and especially “their” parts of ICON as Open Source software, under the
        BSD-3-Clause license.
        We are hoping to promote the development and use of ICON worldwide, enabling researchers
        changing institution during their career to be able to continue to use “their” work, taking away
        the pain of thinking about how to license from researchers, establishing a fair way of sharing IP
        rights between MPG and the individual coder .
        In our presentation we will try to shortly explain the legal background and the legal meanings of
        both documents, describe the process it took to implement these documents at the institute
        (Directors board, Betriebsrat, etc.) and illuminate which IT- and personal infrastructure is
        needed for the policy to work.

        Speakers: Reinhard Budich (MPI for Meteorology), Maximilian Funk (MPI for Meteorology)
      • 24
        Good Practices for Documenting Copyright and License Information in your Software

        After you have clarified the target license of your software, you need to ensure that this information is properly documented.
        But what aspects do you need to consider and how can you achieve it in an efficient way?
        In this talk we discuss minimum practices for documenting copyright and license information for software and show practical examples.
        Particularly, we introduce REUSE Software which supports you with recommendations and tools.

        Speaker: Tobias Schlauch (DLR)
      • 25
        Data management for heterogeneous research environments with CaosDB -- Experiences from an MPDL Open Source development project

        Experimental and theoretical scientists in the turbulence department at the MPI-DS in Göttingen produce a large variety of heterogeneous data and analyze it in a number of different environments. In an MPDL project, the open source research data management software CaosDB was enhanced to meet these needs and hopefully those of other research groups as well.

        We will show the results of this process: automated integration of data from metadata-rich raw HDF 5 files and a new API with language bindings for Octave, C++ and Julia. Additionally, the user documentation was overhauled, programming tutorials published and perfomance bottlenecks identified. We will also share insights about "soft" measures to increase the overall utility of semantic data management: practical guidelines for scientists to produce truly FAIR data and workshops to empower scientists to work with CaosDB.

        Speakers: Daniel Hornung (IndiScale), Florian Spreckelsen (IndiScale GmbH), Freija Nordsiek (MPI for Dynamics and Self-Organization)
      • 26
        Developing a software platform for a systems’study of human history

        Pandemics, war, inequality, environmental and climatic degradation, identitarian conflicts, and the rise of extreme political movements are not isolated phaenomena but rather intertwined in ways often difficult to detect. This reminds us that human societies are complex dynamical systems and themselves part of a broader human-environmental system or nested by social, economic, cultural, technological, political, and climatic systems. Gaining a better understanding of the intricate connections among such systems is a difficult but crucial task for policy makers and risk analysts facing modern-day societal problems. Yet, the useful contribution that could be made by the systematic study of past societies remains largely unexplored.To gain useful historical knowledge I am developing a systems’ approach for the comparative study of past societies. This is grounded on the Big Data initiatives Pandora & IsoMemo that form an interdisciplinary distributive network of open-access historical and archaeological databases plus on a new open-source software platform. Thisplatform consists of a combination of existing and novel self-developed R packages that are made available via user-friendly Shiny interfaces. Bayesian modelling of diverse types of proxy data is extensively employedfor spatiotemporal reconstruction of past human activities or eventsand paleo-environmental conditions. Modelling features include the ability to incorporate historical uncertainties and prior expert knowledge. In addition, a variety of formal hypothesis testing approaches are also being developed together with more data driven methods (e.g., Bayesian networks) for the discovery of historical causal mechanisms.In my presentation, I will givea brief overview of the Pandora & IsoMemo initiatives, describe theirsoftware platform, andillustrate research possibilities via a selection of case studies.

        Speaker: Ricardo Fernandes (Max Planck Institute for the Science of Human History, Jena, Germany)
      • 27
        Managing data at a small-sized sequencing facility
        Speaker: Dr Ilja Bezrukov (MPI Biology)
      • 28
        StudyDB — Key Concepts

        Our StudyDB inherits a lot of functionality from previous Django development efforts. However, StudyDB faces some additional challenges as it is intended for data collection in the context of translational human studies (empirical and prospective research, clinical trials) and »electronic« questionnaires (interfacing with and processing data from our LimeSurvey server).

        We would like to present our implementation of the following key concepts which might be of a more general interest and are not limited to database applications:

        (1) Single Source of Truth: StudyDB manages about 1000 parameters in about 30 tables. We assembled JSON files for each table describing each parameter with its data type, expected range of values and a comment including the units of measurements if applicable (all files are managed in one repository on our Github server). We use this for JSON-Schema validation, generation of test data and Python (Django) source code, online validation of input data, defining function-type fields and automatically generated documentation.

        (2) Timestamping as described in RFC3161 (using the “DFN Zeitstempel” service)

        (3) A generic table viewer optimised for object-level database access including KaTeX based LaTeX rendering.

        For further details, please see "Presentation Materials" section below.

        Speaker: Stefan Vollmar (MPI Stoffwechselforschung)
      • 29
        FAIRness of the mathematical research-data repository MathRepo

        MathRepo, located at https://mathrepo.mis.mpg.de, is an online repository for mathematical research data, in particular for code, software, and teaching material. In this talk I will discuss its current content, the role software plays in mathematical research, and future improvements of the repository regarding the FAIR principles.

        Speaker: Christiane Görgen (Max Planck Institute for Mathematics in the Sciences)
    • Dinner: Light Lunch
    • 30
      Discussion & Farewell