*look into EZKL— if its usable then may allow for using ai in the
Grader Note: in this doc I describe the wishful thinking scenario of this pallet which probably slightly differs from the actual current implementation, with certain functionalities either out-of-scope or implemented with a simpler more nieve approach.
Thank you to Matjaz, Andrzej, Mykyta & Cisco for helping design the rules of the game 🫶!
NOTE ON FRIDAY DURING CLASS: (the rest of the readme doesn't reflect this idea): Shawn was talking about the idea of off-chain zk-proofs with the example of calculating the distribution of staking across validators, and how they submit a proof of their calculation with their score. I believe this might be a good tool to use to make the game cycle much simpler while reducing on-chain computation. This also might open the door for the possibility of much more sophisticated calculations for finding good answers & reward distributions.
While Artificial intelligence has been used pervasively for some time, only recently has its immense potential been recognized by the broader public. Because of this there is increasing pressure to accelerate the pace of development.
Data labeling is both the bottle neck & the moat leveraged by large AI companies. We would like to build a streamlined scalable process for arbitrary data labeling that is inherently open source.
This pallet implements the logic for a game incentivizing players to calculate & share the output of some arbitrary pseudo-deterministic
While players (label providers) may use AI themselves to submit their answers, it is assumed that (1) some human-effort is required or (2) SOTA (state-of-the-art) for
| Sets | |
|---|---|
| The sample set. Arbitrarily large data | |
| The label set. Arbitrarily large data | |
Representational feature space of n dimensions |
|
| The loss set. Equivelent to |
|
| The reward set |
| Functions | ||
|---|---|---|
|
|
The function for label providers to solve | off-chain |
| Compression Function | off-chain | |
| Loss function conditional on some ideal |
on-chain | |
| Reward Function | on-chain |
| Start | A sample provider submits a sample given by a location (an IPFS URL) and description of |
| Commit | Players join the game by committing a hash of their answer's location and provide some collateral. |
| Reveal | Players reveal the location of their answer. If they fail to submit their revealed answer during the reveal phase, they are marked as malicious and their collateral is slashed. |
|
Score and Tattle (OUT OF SCOPE) |
Players play another round of Commit and Reveal, on a known deterministic function who's answer is given by the concatenation of: (1) The condensed representations of all submitted labels given by (2) All instances of malicious behavior by label providers, represented as a vector of booleans. Detected malicious behavior at this stage means a revealed IPFS URL that either does not exist or contains data that does not match their submitted answer. Unlike the previous phase, players commit/reveal the actual answer to this function, not an external location. This is fine because the output is reasonably small. We expect nearly all submitted answers to be exactly the same, and consider this to be the correct answer, which is used in the next phase. If a small portion of answers differ from the popular answer, we mark the players who submitted these answers to be slashed. If there is no clear popular answer, we abort the game and return all held balances. This phase is expected to be as automated as possible, with (1) being fully automated, and (2) requiring only minimal human intervention, or fully automated by some trusted AI (auto-detecting bad content at an IPFS URL should be trivial). |
| Resolve | A theoretic true |
In the initial spec I made the mistake of thinking an individual actor could calculate their portion of k-means (our reveal. Additionally, in the current implementation, there is no way to confirm the actual data inside of a label provider's IPFS URL. But the above table describes the more functional game. The currently implemented game cycle can be found here.
The payout distribution is given by some reward function
In our runtime we chose a geometric series, given by some bounty * 0.5^nth_place. The intended purpose of this choice is to heavily reward the best players and incentivize highly competitive labels. (Softmax would be a very interesting selection. I think I would prefer it to a geometric series—— but I'm not even gonna try to implement that with this on chain math.)
In the future, it would be ideal to provide a variety of pre-set options for
A common labeling task in the medical field is segmentation data. Segmenting data such as blood vessels in a CT scan of a kidney is time-intensive, taking months of effort to label a single scan. Additionally, datasets are often fragmented & closed source, which leads to an environment where it is challenging to train accurate models due to noisy, small datasets.
By automating the availability of labelling efforts through market dynamics and forcing the results to be open source, we can accrue much larger high quality datasets than previously available, and in turn train more accurate, open source models.
While it may not be viable to post a sample with a 3-month long submission period, one could easily split large scans into smaller chunks, and submit these as separate samples expected to take 10-30 minutes to complete.
Because of the overhead of fees on extrinsic calls, verification games, and IPFS write/reads on a per sample basis, this is not well suited for extremely small samples with small rewards. The way to overcome this is with batching.
EXAMPLE: If a sample is an image and Is there a dog in this image? your response should be 0 for no or 1 for yes, the bounty a sample provider is willing to put-up for this task is likely too small to cover the overhead costs. Instead we can say a sample is a batch of images, and the prompt is Given a collection of images your response should be an array of 0 or 1, where the i'th element is given by { 1 for "dog in the i'th image" or 0 for "no dog in the i'th image"}
AI engineers who have developed a state-of-the-art model for some in-demand task could affectively play as a label provider to "soft open source" their model through black-box knowledge distillation while simultaneously claiming a reward for their contribution, until their model is fully distilled and no longer competitive.
We maintain a minimal on-chain state, which acts essentially as RAM for samples currently being processed. We emit events for sample providers to receive their label locations and expect them to track this information themselves. By aggressively cleaning up the state we can make this much more cost effective. A third party may aggregate this historic information to serve queries into already emitted, but still active samples.
- We assume that co-operation is always the best strategy for players, pinned to some tunable collateral that is slashed in the event of malicious behavior. (should be able to set per dataset)
- The average of the submitted labels is a reasonable approximation of the ground truth, and the distance from this average is a good proxy for the quality of a label.
- The correct answer to
$\mathcal{F}$ is a schelling point.
Because we use the average of the submitted labels to measure the quality of any given label, a natural attack vector is to submit the same label many times in order to pull the average toward your own submission. Lets consider several points:
- Each label provider is required to submit a deposit that may be slashed in the event of malicious behavior.
- Something about people chain
- Requires label submitters to be actively connected & listening to emitted during the later phases, which could be a lot of time later
- potential solution: the players in score_and_tattle can be completely different players from the label provider set, but this would require additional reward pulled from the bounty, increasing costs for sample providers
TODO! actually specify current implementation
| Commit | Players commit a hash of their answer's location |
| Reveal | Players reveal the location of their answer |
| Score and Tattle (OUT OF SCOPE) |
Players play another, slightly modified, round of Commit and Reveal which calculates 2 things. (1) The condensed representation of each label's data (2) All instances of foul play by label providers TODO: explain the nuanced difference between how to interpret the correctness of answers in this game vs the previous Commit/Reveal game. exact, deterministic, commonly known, & small answer. |
| Resolve | TODO: write this section |
- We should be deleting samples/labels from storage after a sample is resolved (the data a sample provider needs is still in IPFS and thats all they really need... no need to store the historic metadata of all samples/labels in state)
- Further investigation into selection for
$\mathcal{C}$ and$\mathcal{D}$ . - reporting fraud (i.e. bad ipfs links). Needs a seperate chain incentivized to report fraud, or a centralized solution w/ privilaged access to report fraud.
- (OOS) Python API for sample & label Providers
- (OOS) There should be a concept of a Data Set which samples are attached to. The dataset would hold the description, not the Sample. I'm leaving this out of the scope of this assignment.
-
withdrawcall. There's no reason not to allow labellers to withdraw their submission (and collatorol) any time before the Submission phase ends. -
(OOS) Sample submitters may need to also provide a held deposit for each active sample they have, to prevent an attack vector by spamming the system with many new samples at the same time. Maybe should have a linear relationship with
submission_period, because this determines how long the sample will live in memory on chain? But also, their money will be locked for a longer period so a constant time deposit is probably fine - Score & tattle is perfect to be implemented as an offchain worker?
- We need a variance calculation in addition to the average so that we can:
- Slash answers that are beyond a threshold relative to the mean/var
- Detect wash situations where variance is too high and we just return holds to everyone involved (besides providers with bad reveals— Still slash them)

