Skip to content

geomar-od-lagrange/2025_Project-Idea-Biological-Trajectory-Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Project Idea: Biological Trajectory Clustering

Background

The analysis of connectivity due to dispersal of biological particles (e.g. larvae, eggs, pathogens) involves the physical simulation of pathways followed by passive particles as they drift with ocean currents, but also means modelling the under-way development of these particles (e.g. growth or death due to availability or shortness of food or due to temperature changes along the way). While physical mechanisms governing dispersal are generally well understood, there is a lot of uncertainty in our biological understanding. Hence, we'd often like to test many different sets of biological parameters based on the same set of physical trajectories.

As a tool for this work, we'd like to be able to provide experts in the biological mechanisms with a way to test their understanding by exploring a physical dispersal simulation. A simple example of such a product is shown below.

Oyster Connectivity Dashboard

Problem

We have of the order of hundreds of Gigabytes of trajectories of simulated particles in the North Sea region. These trajectories consist of multiple time series describing the geographic location, $z(t)$, $y(t)$, $x(t)$, and ambient conditions such as temperature, $T(t)$, salinity, $S(t)$.

$$\mathrm{traj}_n = {t\in[t_{ns}, t_{ne}]: z_n(t), y_n(t), x_n(t), T_n(t), S_n(t)}$$

We split the North Sea region into hexagons $h$ of approximately 10km radius and split all trajectories into groups $\mathcal{T}_{h_0, h_1}$ according to the hexagon $h_0$ they start in and the hexagon $h_1$ they end in.

$$\mathcal{T}_{h_0, h_1} = {\mathrm{traj}_n: (x_n(t_{ns}), y_n(t_{ns})) \in h_0, (x_n(t_{ne}), y_n(t_{ne})) \in h_1}$$

Then, for each trajectory, we apply our biological model and, for example, estimate the probability $p(\mathrm{traj}_n)$ that a biological particle following the trajectory actually survives the journey and could settle in the final location:

$$p(\mathrm{traj}_n) = \mathrm{biology}(z_n(t), y_n(t), x_n(t), T_n(t), S_n(t))$$

Finally, we calculate a connection probability between $h_0$ and $h_1$ as by averaging the survival probabilities of all trajectories connecting $h_0$ and $h_1$:

$$p_{h_0, h_1} = \frac{\sum_n p(\mathrm{traj}_n)}{\sum_n 1}$$

However, handling the raw trajectory data each time a new biological model needs to be tested means repeated processing hundreds of Gigabyes of data. On the other hand, we can assume that the number of degrees of freedom in $\mathcal{T}_{h_0, h_1}$ is much smaller than its size. Or in other words, the number of trajectories that are significantly different from the perspective of given classes of biological models is a lot smaller than the total number of trajectories connecting $h_0$ and $h_1$.

Possible solutions

Identifying the relevant degrees of freedom in $\mathcal{T}_ {h_0, h_1}$ can be understood as an unsupervised-learning task. If we can identify clusters $C$ of trajectories which lead to the same (or sufficiently similar) survival probabilities under a given class of biological models, we can reduce the effort for estimating $p_{h_0, h_1}$ to

$$p_{h_0, h_1} \approx \frac{\sum_{C} w_C p(\mathrm{traj}_C)}{\sum_C w_C}$$

where $w_C$ is the size of weight of the cluster $C$ and $\mathrm{traj}_C$ is a typical trajectory from the cluster $C$.

Data

TBD

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •