table of contents
mlpack_random_forest(1) | User Commands | mlpack_random_forest(1) |
NAME¶
mlpack_random_forest - random forests
SYNOPSIS¶
mlpack_random_forest [-m unknown] [-l string] [-D int] [-g double] [-n int] [-N int] [-a bool] [-s int] [-d int] [-T string] [-L string] [-t string] [-V bool] [-M unknown] [-p string] [-P string] [-h -v]
DESCRIPTION¶
This program is an implementation of the standard random forest classification algorithm by Leo Breiman. A random forest can be trained and saved for later use, or a random forest may be loaded and predictions or class probabilities for points may be generated.
The training set and associated labels are specified with the '--training_file (-t)' and '--labels_file (-l)' parameters, respectively. The labels should be in the range [0, num_classes - 1]. Optionally, if '--labels_file (-l)' is not specified, the labels are assumed to be the last dimension of the training dataset.
When a model is trained, the '--output_model_file (-M)' output parameter may be used to save the trained model. A model may be loaded for predictions with the '--input_model_file (-m)'parameter. The '--input_model_file (-m)' parameter may not be specified when the '--training_file (-t)' parameter is specified. The '--minimum_leaf_size (-n)' parameter specifies the minimum number of training points that must fall into each leaf for it to be split. The '--num_trees (-N)' controls the number of trees in the random forest. The ’--minimum_gain_split (-g)' parameter controls the minimum required gain for a decision tree node to split. Larger values will force higher-confidence splits. The '--maximum_depth (-D)' parameter specifies the maximum depth of the tree. The '--subspace_dim (-d)' parameter is used to control the number of random dimensions chosen for an individual node's split. If ’--print_training_accuracy (-a)' is specified, the calculated accuracy on the training set will be printed.
Test data may be specified with the '--test_file (-T)' parameter, and if performance measures are desired for that test set, labels for the test points may be specified with the '--test_labels_file (-L)' parameter. Predictions for each test point may be saved via the '--predictions_file (-p)'output parameter. Class probabilities for each prediction may be saved with the ’--probabilities_file (-P)' output parameter.
For example, to train a random forest with a minimum leaf size of 20 using 10 trees on the dataset contained in 'data.csv'with labels 'labels.csv', saving the output random forest to 'rf_model.bin' and printing the training error, one could call
$ mlpack_random_forest --training_file data.csv --labels_file labels.csv --minimum_leaf_size 20 --num_trees 10 --output_model_file rf_model.bin --print_training_accuracy
Then, to use that model to classify points in 'test_set.csv' and print the test error given the labels 'test_labels.csv' using that model, while saving the predictions for each point to 'predictions.csv', one could call
$ mlpack_random_forest --input_model_file rf_model.bin --test_file test_set.csv --test_labels_file test_labels.csv --predictions_file predictions.csv
OPTIONAL INPUT OPTIONS¶
- --help (-h) [bool]
- Default help info.
- --info [string]
- Print help on a specific option. Default value ''.
- --input_model_file (-m) [unknown]
- Pre-trained random forest to use for classification.
- --labels_file (-l) [string]
- Labels for training dataset.
- --maximum_depth (-D) [int]
- Maximum depth of the tree (0 means no limit). Default value 0.
- --minimum_gain_split (-g) [double]
- Minimum gain needed to make a split when building a tree. Default value 0.
- --minimum_leaf_size (-n) [int]
- Minimum number of points in each leaf node. Default value 1.
- --num_trees (-N) [int]
- Number of trees in the random forest. Default value 10.
- --print_training_accuracy (-a) [bool]
- If set, then the accuracy of the model on the training set will be predicted (verbose must also be specified).
- --seed (-s) [int]
- Random seed. If 0, 'std::time(NULL)' is used. Default value 0.
- --subspace_dim (-d) [int]
- Dimensionality of random subspace to use for each split. '0' will autoselect the square root of data dimensionality. Default value 0.
- --test_file (-T) [string]
- Test dataset to produce predictions for.
- --test_labels_file (-L) [string]
- Test dataset labels, if accuracy calculation is desired.
- --training_file (-t) [string]
- Training dataset.
- --verbose (-v) [bool]
- Display informational messages and the full list of parameters and timers at the end of execution.
- --version (-V) [bool]
- Display the version of mlpack.
OPTIONAL OUTPUT OPTIONS¶
- --output_model_file (-M) [unknown]
- Model to save trained random forest to.
- --predictions_file (-p) [string]
- Predicted classes for each point in the test set.
- --probabilities_file (-P) [string]
- Predicted class probabilities for each point in the test set.
ADDITIONAL INFORMATION¶
For further information, including relevant papers, citations, and theory, consult the documentation found at http://www.mlpack.org or included with your distribution of mlpack.
12 December 2020 | mlpack-3.4.2 |