Package rdkit :: Package ML :: Package SLT :: Module Risk
[hide private]
[frames] | no frames]

Source Code for Module rdkit.ML.SLT.Risk

  1  # 
  2  #  Copyright (C) 2000-2008  greg Landrum 
  3  # 
  4  """ code for calculating empirical risk 
  5   
  6  """ 
  7  import math 
  8   
9 -def log2(x):
10 return math.log(x)/math.log(2.)
11
12 -def BurgesRiskBound(VCDim,nData,nWrong,conf):
13 """ Calculates Burges's formulation of the risk bound 14 15 The formulation is from Eqn. 3 of Burges's review 16 article "A Tutorial on Support Vector Machines for Pattern Recognition" 17 In _Data Mining and Knowledge Discovery_ Kluwer Academic Publishers 18 (1998) Vol. 2 19 20 **Arguments** 21 22 - VCDim: the VC dimension of the system 23 24 - nData: the number of data points used 25 26 - nWrong: the number of data points misclassified 27 28 - conf: the confidence to be used for this risk bound 29 30 31 **Returns** 32 33 - a float 34 35 **Notes** 36 37 - This has been validated against the Burges paper 38 39 - I believe that this is only technically valid for binary classification 40 41 """ 42 # maintain consistency of notation with Burges's paper 43 h = VCDim 44 l = nData 45 eta = conf 46 47 numerator = h * (math.log(2.*l/h) + 1.) - math.log(eta/4.) 48 structRisk = math.sqrt(numerator/l) 49 50 rEmp = float(nWrong)/l 51 52 return rEmp + structRisk
53
54 -def CristianiRiskBound(VCDim,nData,nWrong,conf):
55 """ 56 the formulation here is from pg 58, Theorem 4.6 of the book 57 "An Introduction to Support Vector Machines" by Cristiani and Shawe-Taylor 58 Cambridge University Press, 2000 59 60 61 **Arguments** 62 63 - VCDim: the VC dimension of the system 64 65 - nData: the number of data points used 66 67 - nWrong: the number of data points misclassified 68 69 - conf: the confidence to be used for this risk bound 70 71 72 **Returns** 73 74 - a float 75 76 **Notes** 77 78 - this generates odd (mismatching) values 79 80 """ 81 # maintain consistency of notation with Christiani's book 82 83 d = VCDim 84 delta = conf 85 l = nData 86 k = nWrong 87 88 structRisk = math.sqrt((4./nData) * ( d*log2((2.*math.e*l)/d) + log2(4./delta) )) 89 rEmp = 2.*k/l 90 return rEmp + structRisk
91
92 -def CherkasskyRiskBound(VCDim,nData,nWrong,conf,a1=1.0,a2=2.0):
93 """ 94 95 The formulation here is from Eqns 4.22 and 4.23 on pg 108 of 96 Cherkassky and Mulier's book "Learning From Data" Wiley, 1998. 97 98 **Arguments** 99 100 - VCDim: the VC dimension of the system 101 102 - nData: the number of data points used 103 104 - nWrong: the number of data points misclassified 105 106 - conf: the confidence to be used for this risk bound 107 108 - a1, a2: constants in the risk equation. Restrictions on these values: 109 110 - 0 <= a1 <= 4 111 112 - 0 <= a2 <= 2 113 114 **Returns** 115 116 - a float 117 118 119 **Notes** 120 121 - This appears to behave reasonably 122 123 - the equality a1=1.0 is by analogy to Burges's paper. 124 125 """ 126 # maintain consistency of notation with Cherkassky's book 127 h = VCDim 128 n = nData 129 eta = conf 130 rEmp = float(nWrong)/nData 131 132 numerator = h * (math.log(float(a2*n)/h) + 1) - math.log(eta/4.) 133 eps = a1 * numerator / n 134 135 structRisk = eps/2. * (1. + math.sqrt(1. + (4.*rEmp/eps))) 136 137 return rEmp + structRisk
138