Query By Committee and Challenges in Batch Learning
Active learning known as optimal experimental design or query learning is a subfield of machine learning and more generally artificial intelligence. The concept of query learning was introduced in 1988 by D. Angluin in [Ang88] and active learning has been subject of various studies as it is shown in the literature survey [Set09]. The key concept of active learning is that a machine learning algorithm can achieve greater accuracy when it is allowed choose the data from which it learns. Active learning systems need to sample instances from a set of unlabelled instances and to use a query frameworks to decide whether to query an oracle (human annotator or a device which returns the true label of the instance) to obtain the label of the instance or to discard it.
Query By Committee (QBC) is query framework of active learning. A query framework can be described as a formal description of the methods which a machine adopts to sample the instances which it learns from. QBC was proposed in 1992 by H. S. Seung, M. Opper and H. Sompolinsky in [SOS92]. QBC is a very compact algorithm, it comprises of seven distinct operations and it uses the concept of informativeness of an instance as criteria to query the oracle. The informativeness of an instance is the mutual information [Kol56] I(X; Y ) between two random variables X, Y with joint density distribution f (x, y) which is related to relative entropy or Kullback-Leibler distance D(f||g) of two probability density functions f and g by the equivalence
I(X; Y ) = D(f(x, y)||f(x)f(y))
QBC generalized algorithm
The theoretical foundations of QBC and the original algorithm are presented in details in two related papers [SOS92] and [FSST97].
The main operations carried in the algorithm are:
1. Set up. One or a small number of labeled instances are acquired to form the initial training set. The initial number of instances depends on the family F of learning algorithms.
2. Training. A set of classifiers are learned from the initial training set using learning algorithms from the family F. The set of classifiers produced by the learning algorithms is base of the committee. The set of learning functions of the classifiers is consistent with the training set and therefore can interpreted as the Version Space [Mit82] for the committee. Each member of the committee, that is each classifier, has a learning function or hypothesis which is consistent with the training set.
3. Sampling. An unlabeled instance is drawn at random from the set of unlabeled instances and passed to the committee. The committee also samples two or more classifiers or hypotheses from the set above using a defined probability distribution. These hypotheses are said to be competing. Each classifier, produces a label, or a value for the unlabeled instance. This label is generated in accordance with all previously labeled instances. The labels produced by classifiers can be either all equal or contending that is at least two classifiers predicts a different label for the same instance. When the labels are contending the committee is said to disagree on the instance.
4. Disagreement measurement. The level of disagreement among the competing hypotheses is the determined by a disagreement measure. The metric can either compute exclusive disjunction (XOR) of the committee classifications or it can view the vote distribution as probabilities and it can use an uncertainty measure such as vote entropy [DE95] or Kullback-Leibler (KL) divergence [KL51]. In [SOS92] the authors emphasized on the equivalence between mutual information of an instance and the relative entropy. Since the mutual information cannot be computed the learner can infer the informativeness of the instance by measuring the relative entropy.
5. Decision. The algorithm decides on the basis of the value obtained on the previous operation whether to query the oracle or to discard the instance.
6. Oracle. When the oracle is queries with an instance, it returns or unveils the true label for the instance.
7. Retraining. The labeled instance obtained from querying the oracle is added to the training set and the committee is then retrained using the new training set.
There are two fundamental properties of the generalized QBC algorithm:
-
During the (re)training and sampling phase QBC maintains a version space, that is a the set of the learning functions of the classifiers of the committee which are consistent with all previously labeled instances.
-
The algorithm is proved [FSST97] to converge to a generalization or misclassifcation error ε > 0 in (log 1/ε)
The generalized QBC algorithm was never implemented as it has been described because the Sampling operation requires the ability to sample from a large, version space and to maintain it, that is to retrain all classifiers, and both tasks can not be carried out in a reasonable amount of time as it is shown in [BFS02]. In research to overcome this limitation there has been two approaches which focus on redefining the Sampling operation:
-
The first approach which was initially proposed in [Das04] as a generic active learning sampling strategy, uses a greedy search strategy on the whole set of unlabeled instances that is to evaluate a label for all instances in the unlabeled set and then search for the instance that minimizes the generalization error. In this setting to search the next instance to query it is necessary to evaluate the entire set of unlabeled instances. This approach is proved [Das04] to be close to optimal and it requires only one classifier. The shortcoming of this approach is that for large data sets it is very computational intensive.
-
The second approach, which is used in KQBC [GbNT05], replaces entirely the sampling operation by specific algorithms that return a probability Pr[ ] for an instance x for each class. The main idea behind this approach is that instances which have similar probability across the classes are the those which the classifier can not distinguish and therefore they are worth to be queried.
For example, in a binary classification task y ∈ 0, 1, KQBC [GbNT05] samples an instance from the set of unlabeled instances which has the smallest difference between the Pr[y = 0] and Pr[y = 1].
KQBC determines the probability Pr[] by using a stochastic approach that queries for the label y of a given instance x with probability 2 Pr[w · x > 0] Pr[w · x < 0] where w is the marginal plane for the version space
V = {w : ||w|| ≤ 1 and ∀i yi(w · xi) > 0}.
Therefore the main difference with the generalized QBC algorithm 1 is that in KQBC there is no need to sample two random hypotheses from the version space. It is worth to notice that the KQBC sampling approach is restricted to linear classifiers and it is similar to Bayes point machines [HGC01].
Batch active learning
In the previous section we showed that the shortcoming of a greedy search strategy such as [Das04] is that is necessary to evaluate the entire set of unlabeled instances which can be a computational intensive and time consuming task. This is not limited to QBC but it is a general problem in active learning especially since data sets keeps expanding in the number of unlabeled instance and the number of feature per instance. A common approach batch active learning [HJZL06, RT10, SR08] tackles this problem is to sample a batch of instances at the time, meaning that N (batch size) instances are sampled at one time and their labels are requested to the oracle simultaneously. In [RT10] the authors proved three important properties of batch active learning when compared to active learning:
-
A linear speed-up until N is equal to the dimensions of the version space for some families of target functions.
-
A logarithmic speed-up in all cases.
-
A logarithmic speed-up at most for large value of N.
A simple strategy to perform batch active learning is to select the top N most informative instances, however this strategy presents an issue [Set09]
Myopically querying the “Q-best” queries according to some instance- level query strategy often does not work well, since it fails to consider the overlap in information content among the “best” instances.
The challenge in batch active learning is how to determine strategy to assemble the optimal query set Q.
Reference
[BFS02] Ran Bachrach, Shai Fine, and Eli Shamir. Query by committee, linear separation and random walks. Theoretical Computer Science, 284:2002, 2002.
[Das04] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In In Advances in Neural Information Processing Systems, pages 337–344. MIT Press, 2004.
[FSST97] Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine Learning, 28:133–168, 1997. 10.1023/A:1007330508534.
[HGC01] Ralf Herbrich, Thore Graepel, and Colin Campbell. Bayes point machines. J. Mach. Learn. Res., 1:245–279, September 2001.
[Kol56] A. Kolmogorov. On the Shannon theory of information transmission in the case of continuous signals. Information Theory, IRE Transactions on, 2(4):102 –108, december 1956.
[Mit82] Tom M. Mitchell. Generalization as search. Artificial Intelligence, 18(2):203 – 226, 1982.
[RT10] Philippe Rolet and Olivier Teytaud. Complexity bounds for batch active learning in classification. In José Balcàzar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag, editors, Machine Learning and Knowledge Discovery in Databases, volume 6323 of Lecture Notes in Computer Science, pages 293–305. Springer Berlin / Heidelberg, 2010. 10.1007/978-3-642-15939-8_19
[SOS92] H. S. Seung, M. Opper, and H. Sompolinsky. Query by commit- tee. In Proceedings of the fifth annual workshop on Computational learning theory, COLT ’92, pages 287–294, New York, NY, USA, 1992. ACM.
[Set09] Burr Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison, 2009.
Very interesting. Found this in my search for some references to disagreement measures. Think however you have forgotten two references: [DE95] and [KL51].