Run Biomarker
To identify biomarkers for a specific binary classification problem,
users need to specify the taxonomy level and target variable. In the
Advanced Options, users can also specify the number of CV repeats,
number of CV folds, and top biomarker proportion. For example, with a
3-repeats 3-fold cross validation, animalcules will randomly split the
dataset into 3 fold and run CV, then this procedure is repeated 3 times
(each time will have a different random data split). The top biomarker
proportion defines the threshold for selecting biomarkers: animalcules
will generate a classification model based importance score for each
microbe/feature and will choose the top 20% (based on the selected
proportion which is 0.2 as default) features as the biomarkers.
Users can also choose binary classification models including logistic
regression and random forest. After clicking the button “Run”, the
biomarker list will show up at the right-hand side.
Note:
- If the dataset is too small or unbalanced, cross-validation cannot
be applied. You will see an error messages like: “NA/NaN/Inf in foreign
function call”.
- The target variable cannot contain any special characters, otherwise
there will be an error.
Instructions:
- Select taxonomy level in the menu (default: genus).
- Select the target variable for biomarker identification.
- (Optional) Select number of CV folds (default: 3).
- (Optional) Select number of CV repeats (default: 3).
- (Optional) Select top biomarker proportion based on importance score
(default: 0.2, representing 20%).
- (Optional) Select model (default: logistic regression).
- Click the button “Run”
Running time:
- Test dataset with 30 samples and 427 microbes: 8.5s
- Test dataset with 587 samples and 203 microbes: 32.4s
Importance Plot
Ranked feature importance score plot for the identified biomarkers is
showed here. The higher the score, the more important this feature
(species, genus, ..) in regard to the prediction power.
CV ROC Plot
The identified biomarkers were used to re-train the model via a
cross-validation and a ROC plot is showed automatically in this tab.