posted on 2025-08-22, 08:23authored bySoham Savarkar, Jason Gibson, Balasubramanian Vasanthakumar, Brij M. Moudgil, Richard Hennig
<p dir="ltr">This CSV file contains a comprehensively curated dataset comprising physicochemical descriptors and biological assay data for engineered metal oxide nanoparticles. This dataset was specifically developed to support machine learning model training for toxicity prediction and represents the result of an extensive multi-stage data extraction and curation pipeline. Data sources included peer-reviewed publications and reputable open-access repositories such as the NanoPharos database. Details on the data sourcing process, prompt engineering strategies for large language model (LLM)-based extraction, and validation protocols are provided in the Supplementary Information section.</p><p dir="ltr">The dataset consists of 20 key features, which are grouped into four categories: physicochemical properties, biological responses, experimental exposure conditions, and fundamental core properties derived from periodic trends. These features were selected based on a comprehensive review of toxicological mechanisms associated with metal oxide nanoparticles and their interactions with biological systems.</p><p dir="ltr"><b>1. Physicochemical Descriptors:</b></p><p dir="ltr">These features represent the primary characteristics of each nanoparticle and play a critical role in determining toxicity. They include:</p><p dir="ltr">Hydrodynamic diameter (nm): Represents the nanoparticle size in suspension, accounting for solvation and agglomeration effects.</p><p dir="ltr">Zeta potential (mV): A measure of surface charge, influencing nanoparticle stability and interaction with biological membranes.</p><p dir="ltr">Surface area (m²/g): Affects the reactivity and potential for cellular interaction.</p><p dir="ltr">Aggregation state: Describes whether nanoparticles are dispersed, loosely clustered, or highly aggregated.</p><p dir="ltr">Dissolution rate (mg/L): Important for metal oxides that release toxic ions (e.g., Ag⁺, Zn²⁺).</p><p dir="ltr">Metal ion release (mg/L): Quantifies ionic dissolution contributing to oxidative stress.</p><p dir="ltr">Surface chemistry/coating: Encoded categorically, includes PEGylation, citrate, or uncoated states.</p><p dir="ltr">Impurity content (%): Where reported, captures secondary contaminants from synthesis.</p><p dir="ltr"><b>2. Biological Response Variables:</b></p><p dir="ltr">These outcomes were derived from in vitro cytotoxicity assays and serve as toxicity labels or indicators:</p><p dir="ltr">Cell viability (%): A central endpoint indicating survival rate post-exposure.</p><p dir="ltr">Reactive Oxygen Species (ROS) generation: Expressed as fold-increase compared to control.</p><p dir="ltr">Lactate Dehydrogenase (LDH) leakage, apoptosis (%), necrosis (%): Markers of membrane damage and programmed cell death.</p><p dir="ltr">IC50 (µg/mL): The concentration at which 50% of cells are inhibited, used as a toxicity threshold.</p><p dir="ltr">These biological metrics were used to define a binary toxicity label: entries were classified as toxic (1) or non-toxic (0) based on thresholds from standardized guidelines (e.g., ISO 10993-5:2009) and literature consensus. Criteria included IC50 ≤ 100 µg/mL, cell viability ≤ 70%, ROS > 2x control, and zeta potential outside the range of -30 to +30 mV. Entries with high apoptosis/necrosis rates (≥20% increase over control) were also flagged as toxic.</p><p dir="ltr"><b>3. Experimental Conditions:</b></p><p dir="ltr">To capture contextual variation in assay conditions, the dataset includes:</p><p dir="ltr">Exposure dose (µg/mL): Spanning 0–1000 µg/mL across studies.</p><p dir="ltr">Exposure time (hours): Captures both acute and extended exposure regimes (12 and 24 hrs).</p><p dir="ltr">Cell type/model: Consolidated into five broader categories (e.g., human cancer cells, normal human cells, murine cells, etc.).</p><p dir="ltr"><b>4. Core Material Properties:</b></p><p dir="ltr">Four elemental-level descriptors were added to enrich prediction with periodic trends:</p><p dir="ltr">Atomic weight, group number, period number, and electronegativity difference of elements in the nanoparticle core.</p><p dir="ltr">These provide mechanistic insights related to ion release potential, surface reactivity, and redox behavior.</p><p dir="ltr"><b>Data Cleaning and Normalization:</b></p><p dir="ltr">To ensure model reliability and generalizability, extensive preprocessing was undertaken:</p><p dir="ltr">Outlier management: Features with wide value ranges, such as hydrodynamic size or ROS scores, were log-transformed to reduce skewness.</p><p dir="ltr">Missing values: Numerical fields with missing entries were imputed using the median value of the respective column. Categorical inconsistencies were resolved using consistent label encoding.</p><p dir="ltr">Correlation reduction: Features with high interdependence (e.g., apoptosis vs. necrosis) were carefully screened using Pearson correlation analysis. The final dataset yielded an average feature Pearson correlation of 0.19, indicating low multicollinearity.</p><p dir="ltr">Encoding: Categorical variables such as surface coating and cell type were grouped into logical classes and label-encoded to enable model compatibility.</p><p dir="ltr"><b>Applications and Model Compatibility:</b></p><p dir="ltr">The dataset is optimized for use in supervised learning workflows and has been tested with algorithms such as:</p><p dir="ltr">Gradient Boosting Machines (GBM),</p><p dir="ltr">Support Vector Machines (SVM-RBF),</p><p dir="ltr">Random Forests, and</p><p dir="ltr">Principal Component Analysis (PCA) for feature reduction.</p><p dir="ltr">Training-validation experiments demonstrated accuracies up to 83% in toxicity classification, affirming the predictive potential of the curated descriptors. The dataset also enables parameter space mapping, allowing the generation of 2D/3D response surfaces showing toxicity trends across varying core sizes and dosages.</p><p dir="ltr">This curated dataset addresses several limitations of existing toxicological datasets by enhancing feature diversity, standardization, and data quality control. It is publicly available via the Supplementary Information section and aims to serve as a benchmark resource for researchers developing predictive nanotoxicology models.</p>