Verified Training Samples

*look into EZKL— if its usable then may allow for using ai in the $\mathcal{C}$ function

Grader Note: in this doc I describe the wishful thinking scenario of this pallet which probably slightly differs from the actual current implementation, with certain functionalities either out-of-scope or implemented with a simpler more nieve approach.

Thank you to Matjaz, Andrzej, Mykyta & Cisco for helping design the rules of the game 🫶!

NOTE ON FRIDAY DURING CLASS: (the rest of the readme doesn't reflect this idea): Shawn was talking about the idea of off-chain zk-proofs with the example of calculating the distribution of staking across validators, and how they submit a proof of their calculation with their score. I believe this might be a good tool to use to make the game cycle much simpler while reducing on-chain computation. This also might open the door for the possibility of much more sophisticated calculations for finding good answers & reward distributions.

Verified Training Samples

While Artificial intelligence has been used pervasively for some time, only recently has its immense potential been recognized by the broader public. Because of this there is increasing pressure to accelerate the pace of development.

Data labeling is both the bottle neck & the moat leveraged by large AI companies. We would like to build a streamlined scalable process for arbitrary data labeling that is inherently open source.

This pallet implements the logic for a game incentivizing players to calculate & share the output of some arbitrary pseudo-deterministic $\mathcal{F}(x: \mathbb{X}) \to \mathbb{Y}$. In other words, we are creating an open marketplace for Sample Providers to buy labels from Label Providers.

While players (label providers) may use AI themselves to submit their answers, it is assumed that (1) some human-effort is required or (2) SOTA (state-of-the-art) for $\mathcal{F}$ is beyond human ability, and the sample provider wants to perform knowledge distillation on some existing but closed-source SOTA*. Otherwise sample providers would themselves use AI for their own samples. In this respect, our intentions & target audience differ from other "provide a correct answer to a question" networks such at Bittensor which focuses on model inference.

Formal Definitions

Sets
$\mathbb{X}$	The sample set. Arbitrarily large data
$\mathbb{Y}$	The label set. Arbitrarily large data
$\mathbb{F}_n$	Representational feature space of $\mathbb{Y}$ in `n` dimensions
$\mathbb{L}$	The loss set. Equivelent to $\mathbb{R}^+$
$\mathbb{R}$	The reward set

Functions
$\mathcal{F}(x: \mathbb{X}) \to \mathbb{Y}$.	The function for label providers to solve	off-chain
$\mathcal{C}(y: \mathbb{Y}) \to \mathbb{F}_n$	Compression Function	off-chain
$\mathcal{L}(f: \mathbb{F}_n \mid f^*) \to \mathbb{L}$	Loss function conditional on some ideal $f^*$. Measures the quality of a label	on-chain
$\mathcal{R}(l: [\mathbb{L}]_p) \to [\mathbb{R}]_p$	Reward Function	on-chain

Game Cycle & Rules


Start	A sample provider submits a sample given by a location (an IPFS URL) and description of $\mathcal{F}$, along with putting up some bounty they choose.
Commit	Players join the game by committing a hash of their answer's location and provide some collateral.
Reveal	Players reveal the location of their answer. If they fail to submit their revealed answer during the reveal phase, they are marked as malicious and their collateral is slashed.
Score and Tattle (OUT OF SCOPE)	Players play another round of Commit and Reveal, on a known deterministic function who's answer is given by the concatenation of: (1) The condensed representations of all submitted labels given by $[y]_p \to \mathcal{C} \to [f]_p$ (2) All instances of malicious behavior by label providers, represented as a vector of booleans. Detected malicious behavior at this stage means a revealed IPFS URL that either does not exist or contains data that does not match their submitted answer. Unlike the previous phase, players commit/reveal the actual answer to this function, not an external location. This is fine because the output is reasonably small. We expect nearly all submitted answers to be exactly the same, and consider this to be the correct answer, which is used in the next phase. If a small portion of answers differ from the popular answer, we mark the players who submitted these answers to be slashed. If there is no clear popular answer, we abort the game and return all held balances. This phase is expected to be as automated as possible, with (1) being fully automated, and (2) requiring only minimal human intervention, or fully automated by some trusted AI (auto-detecting bad content at an IPFS URL should be trivial).
Resolve	A theoretic true $f^*$ is calculated by taking the average of $[f]_p$. We then map these features to the reward space given by $[f]_p \to \mathcal{L} \to \mathcal{R} \to [r]_p$. The bounty is distributed accordingly, and all collateral is either returned or slashed.

In the initial spec I made the mistake of thinking an individual actor could calculate their portion of k-means (our $\mathcal{C}$ function) without knowing the results of all other players. The current implementation wishfully assumes this does work for time's sake, and just has label providers submit their k-means output in reveal. Additionally, in the current implementation, there is no way to confirm the actual data inside of a label provider's IPFS URL. But the above table describes the more functional game. The currently implemented game cycle can be found here.

Payout

The payout distribution is given by some reward function $\mathcal{R}(l: [\mathbb{L}]_p) \to [\mathbb{R}]_p$ which defining we leave up to the runtime to implement.

In our runtime we chose a geometric series, given by some bounty * 0.5^nth_place. The intended purpose of this choice is to heavily reward the best players and incentivize highly competitive labels. (Softmax would be a very interesting selection. I think I would prefer it to a geometric series—— but I'm not even gonna try to implement that with this on chain math.)

In the future, it would be ideal to provide a variety of pre-set options for $\mathcal{C}$, $\mathcal{L}$, and $\mathcal{R}$ to allow sample providers to select a configuration best suited for their dataset

Use Cases

Hosipitals

A common labeling task in the medical field is segmentation data. Segmenting data such as blood vessels in a CT scan of a kidney is time-intensive, taking months of effort to label a single scan. Additionally, datasets are often fragmented & closed source, which leads to an environment where it is challenging to train accurate models due to noisy, small datasets.

By automating the availability of labelling efforts through market dynamics and forcing the results to be open source, we can accrue much larger high quality datasets than previously available, and in turn train more accurate, open source models.

While it may not be viable to post a sample with a 3-month long submission period, one could easily split large scans into smaller chunks, and submit these as separate samples expected to take 10-30 minutes to complete.

Batching Small Tasks

Because of the overhead of fees on extrinsic calls, verification games, and IPFS write/reads on a per sample basis, this is not well suited for extremely small samples with small rewards. The way to overcome this is with batching.

EXAMPLE: If a sample is an image and $\mathcal{F}$ is given by Is there a dog in this image? your response should be 0 for no or 1 for yes, the bounty a sample provider is willing to put-up for this task is likely too small to cover the overhead costs. Instead we can say a sample is a batch of images, and the prompt is Given a collection of images your response should be an array of 0 or 1, where the i'th element is given by { 1 for "dog in the i'th image" or 0 for "no dog in the i'th image"}

AI Training AI

AI engineers who have developed a state-of-the-art model for some in-demand task could affectively play as a label provider to "soft open source" their model through black-box knowledge distillation while simultaneously claiming a reward for their contribution, until their model is fully distilled and no longer competitive.

On-Chain State

We maintain a minimal on-chain state, which acts essentially as RAM for samples currently being processed. We emit events for sample providers to receive their label locations and expect them to track this information themselves. By aggressively cleaning up the state we can make this much more cost effective. A third party may aggregate this historic information to serve queries into already emitted, but still active samples.

An Example of $\mathcal{C}$: K-means Algorithm

Discussion

Assumptions

We assume that co-operation is always the best strategy for players, pinned to some tunable collateral that is slashed in the event of malicious behavior. (should be able to set per dataset)
The average of the submitted labels is a reasonable approximation of the ground truth, and the distance from this average is a good proxy for the quality of a label.
The correct answer to $\mathcal{F}$ is a schelling point.

Attack Vectors

Many Submissions

Because we use the average of the submitted labels to measure the quality of any given label, a natural attack vector is to submit the same label many times in order to pull the average toward your own submission. Lets consider several points:

Each label provider is required to submit a deposit that may be slashed in the event of malicious behavior.

Something about people chain

Limitations

Requires label submitters to be actively connected & listening to emitted during the later phases, which could be a lot of time later
- potential solution: the players in score_and_tattle can be completely different players from the label provider set, but this would require additional reward pulled from the bounty, increasing costs for sample providers

Appendix

Current Game cycle

TODO! actually specify current implementation


Commit	Players commit a hash of their answer's location
Reveal	Players reveal the location of their answer
Score and Tattle (OUT OF SCOPE)	Players play another, slightly modified, round of Commit and Reveal which calculates 2 things. (1) The condensed representation of each label's data (2) All instances of foul play by label providers TODO: explain the nuanced difference between how to interpret the correctness of answers in this game vs the previous Commit/Reveal game. exact, deterministic, commonly known, & small answer.
Resolve	TODO: write this section

TODO

We should be deleting samples/labels from storage after a sample is resolved (the data a sample provider needs is still in IPFS and thats all they really need... no need to store the historic metadata of all samples/labels in state)
Further investigation into selection for $\mathcal{C}$ and $\mathcal{D}$.
reporting fraud (i.e. bad ipfs links). Needs a seperate chain incentivized to report fraud, or a centralized solution w/ privilaged access to report fraud.
(OOS) Python API for sample & label Providers
(OOS) There should be a concept of a Data Set which samples are attached to. The dataset would hold the description, not the Sample. I'm leaving this out of the scope of this assignment.
withdraw call. There's no reason not to allow labellers to withdraw their submission (and collatorol) any time before the Submission phase ends.
(OOS) Sample submitters may need to also provide a held deposit for each active sample they have, to prevent an attack vector by spamming the system with many new samples at the same time. Maybe should have a linear relationship with submission_period, because this determines how long the sample will live in memory on chain? But also, their money will be locked for a longer period so a constant time deposit is probably fine
Score & tattle is perfect to be implemented as an offchain worker?
We need a variance calculation in addition to the average so that we can:
- Slash answers that are beyond a threshold relative to the mean/var
- Detect wash situations where variance is too high and we just return holds to everyone involved (besides providers with bad reveals— Still slash them)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dpos		dpos
free-tx		free-tx
label-provider		label-provider
multisig		multisig
pallets		pallets
res		res
runtime		runtime
treasury		treasury
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
ASSIGNMENT.md		ASSIGNMENT.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Verified Training Samples

Formal Definitions

Game Cycle & Rules

Payout

Use Cases

Hosipitals

Batching Small Tasks

AI Training AI

On-Chain State

An Example of $\mathcal{C}$: K-means Algorithm

Discussion

Assumptions

Attack Vectors

Many Submissions

Limitations

Appendix

Current Game cycle

TODO

About

Uh oh!

Releases

Packages

Uh oh!

Languages

charlesHetterich/incentivized-labeling

Folders and files

Latest commit

History

Repository files navigation

Verified Training Samples

Formal Definitions

Game Cycle & Rules

Payout

Use Cases

Hosipitals

Batching Small Tasks

AI Training AI

On-Chain State

An Example of $\mathcal{C}$: K-means Algorithm

Discussion

Assumptions

Attack Vectors

Many Submissions

Limitations

Appendix

Current Game cycle

TODO

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages