Module PandasTools
source code
Importing pandasTools enables several features that allow for using RDKit molecules as columns of a Pandas dataframe.
If the dataframe is containing a molecule format in a column (e.g. smiles), like in this example:
>>> from rdkit.Chem import PandasTools
>>> import pandas as pd
>>> import os
>>> from rdkit import RDConfig
>>> antibiotics = pd.DataFrame(columns=['Name','Smiles'])
>>> antibiotics = antibiotics.append({'Smiles':'CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C','Name':'Penicilline G'}, ignore_index=True)#Penicilline G
>>> antibiotics = antibiotics.append({'Smiles':'CC1(C2CC3C(C(=O)C(=C(C3(C(=O)C2=C(C4=C1C=CC=C4O)O)O)O)C(=O)N)N(C)C)O','Name':'Tetracycline'}, ignore_index=True)#Tetracycline
>>> antibiotics = antibiotics.append({'Smiles':'CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O)O)C','Name':'Ampicilline'}, ignore_index=True)#Ampicilline
>>> print([str(x) for x in antibiotics.columns])
['Name', 'Smiles']
>>> print(antibiotics)
Name Smiles
0 Penicilline G CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C
1 Tetracycline CC1(C2CC3C(C(=O)C(=C(C3(C(=O)C2=C(C4=C1C=CC=C4...
2 Ampicilline CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O...
a new column can be created holding the respective RDKit molecule objects. The fingerprint can be included to accelerate substructure searches on the dataframe.
>>> PandasTools.AddMoleculeColumnToFrame(antibiotics,'Smiles','Molecule',includeFingerprints=True)
>>> print([str(x) for x in antibiotics.columns])
['Name', 'Smiles', 'Molecule']
A substructure filter can be applied on the dataframe using the RDKit molecule column, because the ">=" operator has been modified to work as a substructure check.
Such the antibiotics containing the beta-lactam ring "C1C(=O)NC1" can be obtained by
>>> beta_lactam = Chem.MolFromSmiles('C1C(=O)NC1')
>>> beta_lactam_antibiotics = antibiotics[antibiotics['Molecule'] >= beta_lactam]
>>> print(beta_lactam_antibiotics[['Name','Smiles']])
Name Smiles
0 Penicilline G CC1(C(N2C(S1)C(C2=O)NC(=O)CC3=CC=CC=C3)C(=O)O)C
2 Ampicilline CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=CC=C3)N)C(=O...
It is also possible to load an SDF file can be load into a dataframe.
>>> sdfFile = os.path.join(RDConfig.RDDataDir,'NCI/first_200.props.sdf')
>>> frame = PandasTools.LoadSDF(sdfFile,smilesName='SMILES',molColName='Molecule',includeFingerprints=True)
>>> frame.info # doctest: +SKIP
<bound method DataFrame.info of <class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 0 to 199
Data columns:
AMW 200 non-null values
CLOGP 200 non-null values
CP 200 non-null values
CR 200 non-null values
DAYLIGHT.FPG 200 non-null values
DAYLIGHT_CLOGP 200 non-null values
FP 200 non-null values
ID 200 non-null values
ISM 200 non-null values
LIPINSKI_VIOLATIONS 200 non-null values
NUM_HACCEPTORS 200 non-null values
NUM_HDONORS 200 non-null values
NUM_HETEROATOMS 200 non-null values
NUM_LIPINSKIHACCEPTORS 200 non-null values
NUM_LIPINSKIHDONORS 200 non-null values
NUM_RINGS 200 non-null values
NUM_ROTATABLEBONDS 200 non-null values
P1 30 non-null values
SMILES 200 non-null values
Molecule 200 non-null values
dtypes: object(20)>
Conversion to html is quite easy:
>>> htm = frame.to_html()
>>> str(htm[:36])
'<table border="1" class="dataframe">'
In order to support rendering the molecules as images in the HTML export of the dataframe, the __str__ method is monkey-patched to return a base64 encoded PNG:
>>> molX = Chem.MolFromSmiles('Fc1cNc2ccccc12')
>>> print(molX) # doctest: +SKIP
<img src="data:image/png;base64,..." alt="Mol"/>
This can be reverted using the ChangeMoleculeRendering method
>>> ChangeMoleculeRendering(renderer='String')
>>> print(molX) # doctest: +SKIP
<rdkit.Chem.rdchem.Mol object at 0x10d179440>
>>> ChangeMoleculeRendering(renderer='PNG')
>>> print(molX) # doctest: +SKIP
<img src="data:image/png;base64,..." alt="Mol"/>
|
patchPandasHTMLrepr(self,
**kwargs)
Patched default escaping of HTML control characters to allow molecule image rendering dataframes |
source code
|
|
|
|
|
_get_image(x)
displayhook function for PIL Images, rendered as PNG |
source code
|
|
|
_get_svg_image(mol,
size=(200,200),
highlightAtoms=[])
mol rendered as SVG |
source code
|
|
|
_molge(x,
y)
Allows for substructure check using the >= operator (X has substructure Y -> X >= Y) by... |
source code
|
|
|
|
|
|
|
|
|
|
|
AddMoleculeColumnToFrame(frame,
smilesCol='Smiles',
molCol='ROMol',
includeFingerprints=False)
Converts the molecules contains in "smilesCol" to RDKit molecules and appends them to the dataframe "frame" using the specified column name. |
source code
|
|
|
|
|
LoadSDF(filename,
idName='ID',
molColName='ROMol',
includeFingerprints=False,
isomericSmiles=False,
smilesName=None,
embedProps=False)
Read file in SDF format and return as Pandas data frame. |
source code
|
|
|
WriteSDF(df,
out,
molColName='ROMol',
idName=None,
properties=None,
allNumeric=False)
Write an SD file for the molecules in the dataframe. |
source code
|
|
|
RemoveSaltsFromFrame(frame,
molCol='ROMol')
Removes salts from mols in pandas DataFrame's ROMol column |
source code
|
|
|
|
|
|
|
FrameToGridImage(frame,
column='ROMol',
legendsCol=None,
**kwargs)
Draw grid image of mols in pandas DataFrame. |
source code
|
|
|
AddMurckoToFrame(frame,
molCol='ROMol',
MurckoCol='Murcko_SMILES',
Generic=False)
Adds column with SMILES of Murcko scaffolds to pandas DataFrame. |
source code
|
|
|
AlignMol(mol,
scaffold)
Aligns mol (RDKit mol object) to scaffold (SMILES string) |
source code
|
|
|
AlignToScaffold(frame,
molCol='ROMol',
scaffoldCol='Murcko_SMILES')
Aligns molecules in molCol to scaffolds in scaffoldCol |
source code
|
|
|
v = pd.version.version.split('.')
|
|
defPandasRendering = pd.core.frame.DataFrame.to_html
|
|
pd = None
hash(x)
|
|
highlightSubstructures = True
|
|
molRepresentation = 'png'
|
|
molSize = 200, 200
|
|
_fingerprinter = lambda x, y:
|
|
_saltRemover = None
hash(x)
|
Imports:
b64encode,
types,
copy,
BytesIO,
string_types,
Chem,
Draw,
sys,
traceback,
fmt,
DataStructs,
pyAvalonTools,
SDWriter,
np,
os,
StringIO,
MurckoScaffold,
AllChem
Ensure inheritance of patched to_html in "head" subframe
|
Allows for substructure check using the >= operator (X has substructure Y -> X >= Y) by
monkey-patching the __ge__ function
This has the effect that the pandas/numpy rowfilter can be used for substructure filtering (filtered = dframe[dframe['RDKitColumn'] >= SubstructureMolecule])
|
returns the molecules as base64 encoded PNG image
|
Precomputes fingerprints and stores results in molecule objects to accelerate substructure matching
|
Changes the default dataframe rendering to not escape HTML characters, thus allowing rendered images in all dataframes.
IMPORTANT: THIS IS A GLOBAL CHANGE THAT WILL AFFECT TO COMPLETE PYTHON SESSION. If you want to change the rendering only
for a single dataframe use the "ChangeMoleculeRendering" method instead.
|
AddMoleculeColumnToFrame(frame,
smilesCol='Smiles',
molCol='ROMol',
includeFingerprints=False)
| source code
|
Converts the molecules contains in "smilesCol" to RDKit molecules and appends them to the dataframe "frame" using the specified column name.
If desired, a fingerprint can be computed and stored with the molecule objects to accelerate substructure matching
|
ChangeMoleculeRendering(frame=None,
renderer='PNG')
| source code
|
Allows to change the rendering of the molecules between base64 PNG images and string representations.
This serves two purposes: First it allows to avoid the generation of images if this is not desired and, secondly, it allows to enable image rendering for
newly created dataframe that already contains molecules, without having to rerun the time-consuming AddMoleculeColumnToFrame. Note: this behaviour is, because some pandas methods, e.g. head()
returns a new dataframe instance that uses the default pandas rendering (thus not drawing images for molecules) instead of the monkey-patched one.
|
LoadSDF(filename,
idName='ID',
molColName='ROMol',
includeFingerprints=False,
isomericSmiles=False,
smilesName=None,
embedProps=False)
| source code
|
Read file in SDF format and return as Pandas data frame. If embedProps=True all properties also get embedded in Mol objects in the molecule column.
|
WriteSDF(df,
out,
molColName='ROMol',
idName=None,
properties=None,
allNumeric=False)
| source code
|
Write an SD file for the molecules in the dataframe. Dataframe columns can be exported as SDF tags if specified in the "properties" list. "properties=list(df.columns)" would export all columns.
The "allNumeric" flag allows to automatically include all numeric columns in the output. User has to make sure that correct data type is assigned to column.
"idName" can be used to select a column to serve as molecule title. It can be set to "RowID" to use the dataframe row key as title.
|
SaveSMILESFromFrame(frame,
outFile,
molCol='ROMol',
NamesCol='',
isomericSmiles=False)
| source code
|
Saves smi file. SMILES are generated from column with RDKit molecules. Column with names is optional.
|
SaveXlsxFromFrame(frame,
outFile,
molCol='ROMol',
size=(300,300))
| source code
|
Saves pandas DataFrame as a xlsx file with embedded images.
It maps numpy data types to excel cell types:
int, float -> number
datetime -> datetime
object -> string (limited to 32k character - xlsx limitations)
Cells with compound images are a bit larger than images due to excel.
Column width weirdness explained (from xlsxwriter docs):
The width corresponds to the column width value that is specified in Excel.
It is approximately equal to the length of a string in the default font of Calibri 11.
Unfortunately, there is no way to specify "AutoFit" for a column in the Excel file format.
This feature is only available at runtime from within Excel.
|
AddMurckoToFrame(frame,
molCol='ROMol',
MurckoCol='Murcko_SMILES',
Generic=False)
| source code
|
Adds column with SMILES of Murcko scaffolds to pandas DataFrame. Generic set to true results in SMILES of generic framework.
|