5 Collecting Training Data

Image Collection and Considerations

Authors

Affiliations

Martin Weigert

École Polytechnique Fédérale de Lausanne

Add any co-authors as appropriate

Under your first header, include a brief introduction to your chapter.

Starting prompt for this chapter: Chapter 5 discusses collecting, annotating and validating training data. It should highlight potential pitfalls such as balanced data sets, out-of-distribution problems, etc. It should also address the question: how do you collect training data on your microscope? For example, this chapter should discuss collecting low/high-laser power pairs for the purpose of training an image restoration model.

5.1 The importance of good training data

Why training data quality and diversity are the main drivers of success in AI applications?

High-quality and diverse data often matter more than model architecture (“you are what you eat”).
More data is better—but only if it’s well-curated and representative.
Poor data silently harms performance; robust models require representative variability.

5.2 Data types, formats, and labeling strategies

What are the typical imaging tasks and modalities, and how to generate or obtain corresponding ground truth labels?

Common tasks include classification, restoration, segmentation; data are 2D/3D+t, multi-channel, 8/16-bit stacks.
Ground truth can come from manual annotation, dual acquisition (e.g., high/low power), or informative channels.
Synthetic or weak supervision is helpful, but often limited by oversimplified assumptions.

5.3 Collecting and curating data from the microscope

What is the practical guidance on structured image acquisition and quality control before using data for training?

Plan acquisitions to cover biological variability under consistent settings and with metadata retention.
Use automation (e.g., μManager) and open formats for systematic, reproducible collection.
Visual and automated checks (outliers, file consistency, label alignment) are critical to ensure usable datasets.

5.4 Splitting data and building robust validation

What is the proper dataset splitting and how to ensure meaningful model evaluation?

Use train/validation/test splits: only the test set evaluates final model performance.
Validation must reflect the intended use case—ideally different experiments, microscopes, or conditions.
Avoid data leakage (e.g., adjacent slices in train/test) to ensure unbiased evaluation.

5.5 Dealing with imbalance and data augmentation

What are the common practical challenges and techniques to improve model generalization?

Class imbalance can skew learning—oversample minority classes or use weighting strategies.
Out-of-distribution issues arise with domain shifts; mitigate via diverse training data and testing.
Augmentation helps simulate variation or artifacts, but must respect biological plausibility.

5.6 Include section headers as appropriate

Use markdown heading level two for section headers. You can use standard markdown formatting, for example emphasize the end of this sentence.

This is a new paragraph with more text. Your paragraphs can cross reference other items, such as Figure 11.1. Use fig to reference figures, and eq to reference equations, such as Equation 11.1.

5.6.1 Sub-subsection headers are also available

To make your sections cross reference-able throughout the book, include a section reference, as shown in the header for Section 11.4.

5.7 Bibliography and Citations

To cite a research article, add it to references.bib and then refer to the citation key. For example, reference¹ refers to CellPose and reference² refers to ZeroCostDL4Mic.

5.8 Adding to the Glossary

We are using R code to create a glossary for this book. To add a definition, edit the glossary.yml file. To reference the glossary, enclose the word as in these examples: LLMs suffer from hallucinations. It is important to understand the underlying training data, validation data and false positives to interpret your results. Clicking on the word will reveal its definition by taking you to the entry on the Glossary page. Pressing back in your browser will return you to your previous place in the textbook.

5.9 Code and Equations

This is an example of including a python snippet that generates a figure

show code

import matplotlib.pyplot as plt
plt.plot([1,23,2,4])
plt.show()

In some cases, you may want to include a code-block that is not executed when the book is compiled. Use the eval: false option for this.

show code

import matplotlib.pyplot as plt
plt.plot([1,23,2,4])
plt.show()

Figures can also be generated that do not show the code by using the option for code-fold: true.

show code

import numpy as np
import matplotlib.pyplot as plt

r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
fig, ax = plt.subplots(
  subplot_kw = {'projection': 'polar'} 
)
ax.plot(theta, r)
ax.set_rticks([0.5, 1, 1.5, 2])
ax.grid(True)
plt.show()

A line plot on a polar axis. The line spirals out from a value of zero to a value of 2. — Figure 5.2: A spiral on a polar axis

Here is an example equation.

\[ s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2} \tag{5.1}\]

5.9.1 Embedding Figures

You can also embed figures from other notebooks in the repo as shown in the following embed example.

a demo figure to be embedded in a template chapter — Figure 5.3: Polar plot of circles of random areas at random coords

When embedding notebooks, please store the .ipynb file in the notebook directory. Include the chapter in the name of your file. For example, chapter4_example_u-net.ipynb. This is how we will handle chapter- or example-specific environments. We will host notebooks on Google Colab so that any required packages for the code–but not for rendering the book at large–will be installed there. That way, we will not need to handle a global environment across the book.

5.10 Quarto has additional features.

You can learn more about markdown options and additional Quarto features in the Quarto documentation. One example that you might find interesting is the option to include callouts in your text. These callouts can be used to highlight potential pitfalls or provide additional optional exercises that the reader might find helpful. Below are examples of the types of callouts available in Quarto.

Note

Note that there are five types of callouts, including: note, tip, warning, caution, and important. They can default to open (like this example) or collapsed (example below).

Tip

These could be good for extra material or exercises.

Caution

There are caveats when applying these tools. Expand the code below to learn more.

show code

r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r

Warning

Be careful to avoid hallucinations.

Important

This is key information.