Trees | Indices | Help |
|
---|
|
|
|||
|
|||
|
|||
|
|||
|
|
|||
SeqTypes = list, tuple
|
Imports: random, os, sys, RDConfig, RDRandom, xrange
|
splits a set of indices into a data set into 2 pieces **Arguments** - nPts: the total number of points - frac: the fraction of the data to be put in the first data set - silent: (optional) toggles display of stats - legacy: (optional) use the legacy splitting approach - replacement: (optional) use selection with replacement **Returns** a 2-tuple containing the two sets of indices. **Notes** - the _legacy_ splitting approach uses randomly-generated floats and compares them to _frac_. This is provided for backwards-compatibility reasons. - the default splitting approach uses a random permutation of indices which is split into two parts. - selection with replacement can generate duplicates. **Usage**: We'll start with a set of indices and pick from them using the three different approaches: >>> from rdkit.ML.Data import DataUtils The base approach always returns the same number of compounds in each set and has no duplicates: >>> DataUtils.InitRandomNumbers((23,42)) >>> test,train = SplitIndices(10,.5) >>> test [1, 5, 6, 4, 2] >>> train [3, 0, 7, 8, 9] >>> test,train = SplitIndices(10,.5) >>> test [5, 2, 9, 8, 7] >>> train [6, 0, 3, 1, 4] The legacy approach can return varying numbers, but still has no duplicates. Note the indices come back ordered: >>> DataUtils.InitRandomNumbers((23,42)) >>> test,train = SplitIndices(10,.5,legacy=1) >>> test [3, 5, 7, 8, 9] >>> train [0, 1, 2, 4, 6] >>> test,train = SplitIndices(10,.5,legacy=1) >>> test [0, 1, 2, 3, 5, 8, 9] >>> train [4, 6, 7] The replacement approach returns a fixed number in the training set, a variable number in the test set and can contain duplicates in the training set. >>> DataUtils.InitRandomNumbers((23,42)) >>> test,train = SplitIndices(10,.5,replacement=1) >>> test [9, 9, 8, 0, 5] >>> train [1, 2, 3, 4, 6, 7] >>> test,train = SplitIndices(10,.5,replacement=1) >>> test [4, 5, 1, 1, 4] >>> train [0, 2, 3, 6, 7, 8, 9] |
splits a data set into two pieces **Arguments** - data: a list of examples to be split - frac: the fraction of the data to be put in the first data set - silent: controls the amount of visual noise produced. **Returns** a 2-tuple containing the two new data sets. |
"splits" a data set held in a DB by returning lists of ids **Arguments**: - conn: a DbConnect object - frac: the split fraction. This can optionally be specified as a sequence with a different fraction for each activity value. - table,fields,where,join: (optional) SQL query parameters - useActs: (optional) toggles splitting based on activities (ensuring that a given fraction of each activity class ends up in the hold-out set) Defaults to 0 - nActs: (optional) number of possible activity values, only used if _useActs_ is nonzero Defaults to 2 - actCol: (optional) name of the activity column Defaults to use the last column returned by the query - actBounds: (optional) sequence of activity bounds (for cases where the activity isn't quantized in the db) Defaults to an empty sequence - silent: controls the amount of visual noise produced. **Usage**: Set up the db connection, the simple tables we're using have actives with even ids and inactives with odd ids: >>> from rdkit.ML.Data import DataUtils >>> from rdkit.Dbase.DbConnection import DbConnect >>> conn = DbConnect(RDConfig.RDTestDatabase) Pull a set of points from a simple table... take 33% of all points: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,1./3.,'basic_2class') >>> [str(x) for x in train] ['id-7', 'id-6', 'id-2', 'id-8'] ...take 50% of actives and 50% of inactives: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,.5,'basic_2class',useActs=1) >>> [str(x) for x in train] ['id-5', 'id-3', 'id-1', 'id-4', 'id-10', 'id-8'] Notice how the results came out sorted by activity We can be asymmetrical: take 33% of actives and 50% of inactives: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,[.5,1./3.],'basic_2class',useActs=1) >>> [str(x) for x in train] ['id-5', 'id-3', 'id-1', 'id-4', 'id-10'] And we can pull from tables with non-quantized activities by providing activity quantization bounds: >>> DataUtils.InitRandomNumbers((23,42)) >>> train,test = SplitDbData(conn,.5,'float_2class',useActs=1,actBounds=[1.0]) >>> [str(x) for x in train] ['id-5', 'id-3', 'id-1', 'id-4', 'id-10', 'id-8'] |
Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0.1 on Thu Feb 1 16:13:01 2018 | http://epydoc.sourceforge.net |