Run Biomarker

To identify biomarkers for a specific binary classification problem, users need to specify the taxonomy level and target variable. In the Advanced Options, users can also specify the number of CV repeats, number of CV folds, and top biomarker proportion. For example, with a 3-repeats 3-fold cross validation, animalcules will randomly split the dataset into 3 fold and run CV, then this procedure is repeated 3 times (each time will have a different random data split). The top biomarker proportion defines the threshold for selecting biomarkers: animalcules will generate a classification model based importance score for each microbe/feature and will choose the top 20% (based on the selected proportion which is 0.2 as default) features as the biomarkers.

Users can also choose binary classification models including logistic regression and random forest. After clicking the button “Run”, the biomarker list will show up at the right-hand side.

Note:

  • If the dataset is too small or unbalanced, cross-validation cannot be applied. You will see an error messages like: “NA/NaN/Inf in foreign function call”.
  • The target variable cannot contain any special characters, otherwise there will be an error.

Instructions:

  • Select taxonomy level in the menu (default: genus).
  • Select the target variable for biomarker identification.
  • (Optional) Select number of CV folds (default: 3).
  • (Optional) Select number of CV repeats (default: 3).
  • (Optional) Select top biomarker proportion based on importance score (default: 0.2, representing 20%).
  • (Optional) Select model (default: logistic regression).
  • Click the button “Run”

Running time:

  • Test dataset with 30 samples and 427 microbes: 8.5s
  • Test dataset with 587 samples and 203 microbes: 32.4s

Importance Plot

Ranked feature importance score plot for the identified biomarkers is showed here. The higher the score, the more important this feature (species, genus, ..) in regard to the prediction power.

CV ROC Plot

The identified biomarkers were used to re-train the model via a cross-validation and a ROC plot is showed automatically in this tab.