Trees | Indices | Help |
|
---|
|
Utilities for data manipulation **FILE FORMATS:** - *.qdat files* contain quantized data suitable for feeding to learning algorithms. The .qdat file, written by _DecTreeGui_, is structured as follows: 1) Any number of lines which are ignored. 2) A line containing the string 'Variable Table' any number of variable definitions in the format: '# Variable_name [quant_bounds]' where '[quant_bounds]' is a list of the boundaries used for quantizing that variable. If the variable is inherently integral (i.e. not quantized), this can be an empty list. 3) A line beginning with '# ----' which signals the end of the variable list 4) Any number of lines containing data points, in the format: 'Name_of_point var1 var2 var3 .... varN' all variable values should be integers Throughout, it is assumed that varN is the result - *.dat files* contain the same information as .qdat files, but the variable values can be anything (floats, ints, strings). **These files should still contain quant_bounds!** - *.qdat.pkl file* contain a pickled (binary) representation of the data read in. They stores, in order: 1) A python list of the variable names 2) A python list of lists with the quantization bounds 3) A python list of the point names 4) A python list of lists with the data points
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
Imports: re, csv, random, six, cPickle, xrange, map, RDConfig, fileutils, MLData, DbConnect, BitUtils
|
writes out a .qdat file **Arguments** - outFile: a file object - varNames: a list of variable names - qBounds: the list of quantization bounds (should be the same length as _varNames_) - examples: the data to be written |
reads the variables and quantization bounds from a .qdat or .dat file **Arguments** - inFile: a file object **Returns** a 2-tuple containing: 1) varNames: a list of the variable names 2) qbounds: the list of quantization bounds for each variable |
reads the examples from a .qdat file **Arguments** - inFile: a file object **Returns** a 2-tuple containing: 1) the names of the examples 2) a list of lists containing the examples themselves **Note** because this is reading a .qdat file, it assumed that all variable values are integers |
reads the examples from a .dat file **Arguments** - inFile: a file object **Returns** a 2-tuple containing: 1) the names of the examples 2) a list of lists containing the examples themselves **Note** - this attempts to convert variable values to ints, then floats. if those both fail, they are left as strings |
builds a data set from a .qdat file **Arguments** - fileName: the name of the .qdat file **Returns** an _MLData.MLQuantDataSet_ |
builds a data set from a .dat file **Arguments** - fileName: the name of the .dat file **Returns** an _MLData.MLDataSet_ |
calculates the number of possible values for each variable in a data set **Arguments** - data: a list of examples - order: the ordering map between the variables in _data_ and _qBounds_ - qBounds: the quantization bounds for the variables **Returns** a list with the number of possible values each variable takes on in the data set **Notes** - variables present in _qBounds_ will have their _nPossible_ number read from _qbounds - _nPossible_ for other numeric variables will be calculated |
writes either a .qdat.pkl or a .dat.pkl file **Arguments** - outName: the name of the file to be used - data: either an _MLData.MLDataSet_ or an _MLData.MLQuantDataSet_ |
>>> v = [10,20,30,40,50] >>> TakeEnsemble(v,(1,2,3)) [20, 30, 40] >>> v = ['foo',10,20,30,40,50,1] >>> TakeEnsemble(v,(1,2,3),isDataVect=True) ['foo', 20, 30, 40, 1] |
constructs an _MLData.MLDataSet_ from a database **Arguments** - dbName: the name of the database to be opened - tableName: the table name containing the data in the database - user: the user name to be used to connect to the database - password: the password to be used to connect to the database - dupCol: if nonzero specifies which column should be used to recognize duplicates. **Returns** an _MLData.MLDataSet_ **Notes** - this uses Dbase.DataUtils functionality |
constructs an _MLData.MLDataSet_ from a bunch of text #DOC **Arguments** - reader needs to be iterable and return lists of elements (like a csv.reader) **Returns** an _MLData.MLDataSet_ |
Seeds the random number generators **Arguments** - seed: a 2-tuple containing integers to be used as the random number seeds **Notes** this seeds both the RDRandom generator and the one in the standard Python _random_ module |
#DOC |
#DOC |
randomizes the activity values of a dataset **Arguments** - dataSet: a _ML.Data.MLQuantDataSet_, the activities here will be randomized - shuffle: an optional toggle. If this is set, the activity values will be shuffled (so the number in each class remains constant) - runDetails: an optional CompositeRun object **Note** - _examples_ are randomized in place |
Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0.1 on Thu Feb 1 16:13:01 2018 | http://epydoc.sourceforge.net |