Applications for Machine Learning

Machine learning (ML) offers transformative potential in materials science and engineering by enabling data-driven discovery and optimization within complex materials systems. Through the application of ML algorithms, researchers can analyze vast datasets – often generated from simulations, experiments, or calculations – to identify patterns and relationships that might otherwise go unnoticed. The goal of integrating ML into materials research is to predict properties, optimize compositions, and accelerate the design of materials tailored for specific applications.

Thermo-Calc Software provides essential data on material properties and microstructural features that vary by composition and temperature, making it a valuable integral tool for developing accurate ML models. Below we outline the primary ways in which Thermo-Calc Software’s products can address some of the common challenges found when applying ML to materials science.

As in all cases, the use of Thermo-Calc Software’s products is governed by the terms of our End User License Agreement. Clarifications on usage of Thermo-Calc Software tools and data in training ML models are provided at the end of this webpage.

Data Quality / Data Cleaning

Identify outliers and validate assumptions

Data quality is of the utmost importance to the application of ML in materials science, where reliable predictions depend on clean, consistent data. Noisy or inconsistent data can distort results, leading to models that misrepresent material properties. Data cleaning involves verifying measurement consistency, identifying outliers, and ensuring alignment with physical assumptions (e.g., reporting of liquid viscosity data for semi-solid regions). CALPHAD-based calculations support data quality by identifying outliers and validating assumptions, thus enhancing the datasets used for training.

Publications-page-square-reversed-shutterstock_1355440478

Example: Combined CALPHAD and Machine Learning for Property Modelling

In a thesis by Paulus (2020), Thermo-Calc calculations were made using the TCFE database for steels to calculate the thermodynamic barrier for the martensitic transformation, aiding in identifying stable and metastable regions within the alloy composition space. Firstly, a batch process was used to calculate Gibbs energy thresholds to remove outliers from an original dataset of 3,600 data points. Then an additional check for austenite stability across the composition range was employed to validate which alloys would be useful for ML predictions on martensite start temperatures.

Read the paper: Combined CALPHAD and Machine Learning for Property Modelling

Data Quantity

Generate extensive datasets

Beyond quality, data quantity is also critical for training robust ML models, as larger datasets enable models to detect patterns, make accurate predictions and support greater generalization. However, in materials science, collecting sufficient data can be costly and time-consuming. This can lead to models being trained on composition ranges that are too narrow, or on processing conditions that don’t have sufficient variability to identify and capture all of the important variables and features. CALPHAD is valuable for overcoming this limitation through its ability to generate extensive datasets through thermodynamic and kinetic simulations, filling in gaps where experimental data is scarce.

Example: Combined machine learning and CALPHAD approach for discovering processing-structure relationships in soft magnetic alloys

In a paper by Rajesh et al. (2018), TC-PRISMA simulations were performed for various compositions and processing conditions, specifically within the FINEMET alloy family based on the composition Fe72.89Si16.21B6.9Nb3Cu1. Compositions were extended by varying Fe and Si content within a range of ±3 atomic %, resulting in a three-dimensional parameter space defined by Fe and Si composition, annealing temperature (490°C to 550°C), and annealing time (up to 2 hours).

This approach generated a large, reliable dataset for the mean radius and volume fraction of Fe3Si nanocrystals, comprising 24,000 Thermo-Calc and TC-PRISMA simulations. This dataset was subsequently used to train a machine learning model, by creating surrogate models capable of accurately predicting nanocrystal properties based on input conditions.

Read the paper: Combined machine learning and CALPHAD approach for discovering processing-structure relationships in soft magnetic alloys

One point to note from this example is that the more features one wishes to generate data for, the more the dimensionality of the problem increases. It is therefore important to weigh the size of the composition space and other variables to determine the size of the training set needed.

Additionally, when generating large datasets using high-throughput approaches in conjunction with CALPHAD, it is important to consider the suitability of the data. This is no different than making any CALPHAD prediction, but, for example, there are many examples in the literature where CALPHAD data is generated for room temperature to populate datasets for ML, despite it being unlikely that these systems would ever reach equilibrium at such low temperatures.

Feature Selection

Identify potential features of importance

Feature selection is crucial for the development of physics-informed models, where selected features must represent meaningful physical properties to ensure accurate and interpretable predictions and also to ensure that models don’t violate these well-established principles. The challenge in working with materials data sets, however, is that sometimes these data sets are too small to have sufficient variation in the features of importance that would allow them to be identified, or that the ML models end up being trained for a specific set of conditions that are hard to generalize. Additionally, the data being used from the training set could be too far removed from the path dependent variables that may have influenced its state (for example, yield strength data that would largely depend on how the material was processed without full knowledge of the meta-data that went into this). CALPHAD simulations can help identify potential features of importance by identifying key physical attributes related to composition, temperature, phase stability, phase fractions, and so on.

Example: Coupling physics in machine learning to predict properties of high-temperatures alloys

In a paper by Peng et al. (2020) Thermo-Calc calculations were made using the TCFE database for steels to enhance machine learning predictions of yield strength in high-temperature 9Cr steel alloys. Starting with a 9Cr dataset with only raw experimental data such as elemental alloy compositions, processing, and testing conditions, the training data lacked microstructural features that could further improve the accuracy of the model.

To address this, Thermo-Calc was used to calculate the volume fractions of M23C6 (where a higher volume fraction of M23C6 leads to higher yield strength) as well as the A3 temperature and martensite start temperature which are directly correlated to the martensitic microstructure evolution. Additionally, by organizing the data into temperature-based sub-datasets, the correlations between features and yield strength were improved, enabling the ML models to predict properties across critical temperature ranges reliably.

It is important to stress that the simulations performed to support feature selection must be well-suited to the system being modeled. For example, generating phase stability and phase composition data for a system under equilibrium and comparing it with experimental data from a rapidly solidified and non-equilibrium structure will only confuse the ML model, not improve it.

Read the paper: Coupling physics in machine learning to predict properties of high-temperatures alloys

Interpretability

Translate complex model predictions into clear, physics-based insights

Interpretability is essential in materials science, as understanding the link between machine learning predictions and physical properties guides experimental design and optimization. Explainable AI techniques support this by translating complex model predictions into clear, physics-based insights.

Example: Multi-variate Process Models for Predicting Site-Specific Microstructure and Properties of Inconel 706 Forgings

A paper by Senayake et al. (2023) exemplifies this principle by integrating CALPHAD data into a data analytics framework, linking physics-based simulations and experimental observations to create predictive Process–Structure, Process–Property, and Structure–Property models for forged Inconel 706.

This framework adopted a novel two-part method to approximate nano-scale precipitate distributions. Initially, a machine learning algorithm was used to process SEM images, supporting high-throughput measurements of precipitation distributions. These automated image analyses were then used to calibrate CALPHAD simulations for γ’ and γ” precipitate distributions via TC-PRISMA, using local time/temperature boundary conditions derived from only four samples. After calibrating the interfacial energy, Finite Element simulations extended the thermal profiles to cover 74 additional, unobserved local conditions. This approach effectively reduced the required physical observations of γ’ and γ” distributions from 78 to 4, thus making the process more efficient and feasible by combining available experimental and computational resources.

Read the paper: Multi-variate Process Models for Predicting Site-Specific Microstructure and Properties of Inconel 706 Forgings

Generalization

Enable models to learn a wider range of material behaviors

Generalization is important to applications for materials science, where training datasets often cover limited composition or processing ranges, and yet models are required to predict properties across new materials and conditions. Without good generalization, models often perform poorly in broader applications, reducing their usefulness in research and industry. CALPHAD helps address this by generating data across diverse compositions and conditions, allowing models to learn a wider range of material behaviors.

Example: An explainable machine learning model for superalloys creep life prediction coupling with physical metallurgy models and CALPHAD

In a publication by Huang et al. (2023), an ML model for predicting the creep life of superalloys was developed. In this study, the initial dataset included 14 compositional and testing condition features, which the authors noted might contain redundant information, potentially complicating the model and increasing the risk of overfitting, resulting in reduced accuracy.

To address this, a dimensionality reduction approach was implemented, incorporating physical metallurgy models and CALPHAD to compute the volume and mole fraction of the γ′ phase as a function of composition and temperature using the TCNI database. This transformation reduced the feature set to eight key microstructural characteristics, thereby enhancing model accuracy and generalizability. To evaluate the model’s generalization capabilities, the authors then selected three types of engineering-grade Ni-based single crystal superalloys for creep testing under varied conditions. These compositions, intentionally excluded from the original dataset, provided a rigorous test of the model’s predictive accuracy on new alloy compositions using the combined physical metallurgy and CALPHAD-ML approach. The results indicated that the model was of sufficient prediction accuracy and its prediction results could be reasonably explained.

Read the paper: An explainable machine learning model for superalloys creep life prediction coupling with physical metallurgy models and CALPHAD

Development of Surrogate Models

Explore vast compositional and processing spaces quickly

Developing surrogate models using machine learning offers significant benefits to materials science by providing fast, approximate predictions of complex properties that would otherwise require time-intensive simulations or experiments. These models, trained on data from high-fidelity simulations (like CALPHAD or Finite Element Analysis) or experimental measurements, allow researchers to explore vast compositional and processing spaces quickly.

There is a trade-off, however, in that important features can be lost if the training sets are not sufficiently large. Additionally, many papers, such as Chang et al. (2019) note that a proper network architecture and hyperparameter settings are crucial for a neural network model to achieve decent accuracy and generalization at the same time. Better fitted models can be obtained at the expense of poor generalization, or alternatively, more generalized models can be obtained at the expense of predictive accuracy.

In the context of CALPHAD simulations, surrogate models may be generated that can be used to explore composition space rapidly (such as for HEAs) but the expectation is that they may only indicate in broad terms phases that might be present, for example, rather than quantitative amounts of phases. Even very high fidelity ML models trained on millions of CALPHAD simulations would still not reproduce the accuracy of a single CALPHAD calculation for a specific set of conditions.

Examples of Surrogate Models

One example of a surrogate model was already referred to under Data Quantity, Rajesh et al. (2018), where an ML model was developed for a very narrow composition and processing range of a Finement alloy. Another example was given under Data Quality / Data Cleaning, Paulus (2020), where Thermo-Calc was used to generate a comprehensive dataset to train an artificial neural network (ANN) model to predict sigma phase formation in nickel-based superalloys. Using the TC-Python API, equilibrium calculations were made across a set of 4,000 alloys containing ten alloying elements, using the TCNI9 thermodynamic database. To compare the benefits of using a computationally expensive larger dataset, this was then scaled up to 40,000 data points resulting in an RMSE of 0.016 compared with 0.026 from the smaller dataset. In this reference, the author himself raised the question that while the metamodel can act as an additional filter for the computationally heavy simulations, it is for the researcher to decide whether it is okay to design new alloys with an error of 2.1%, given that materials design is a highly selective process.

Sources of Data for Machine Learning

Predict a wide range of composition and temperature dependent materials property data

Common sources of data include handbooks, but these are usually limited to only the most common alloys, are rarely temperature dependent, and do not capture composition variation or contain data for novel/experimental materials. Thermo-Calc can be used to predict a wide range of composition and temperature dependent materials property data.

thumbnail-supplementing-FEM-modeling-SoMe_1200x630

Example: Supplementing Finite Element Modelling with Calculated Thermophysical Properties

Read an example of how CALPHAD generated data can be used to generate material property data for Finite Element models in our blog post on the subject.

Read the blog post

You can also read a list of the properties that can be predicted using Thermo-Calc.

Use of TC-Python in Workflows for Machine Learning

Automate data processing workflows and integrate predictive capabilities into broader ML frameworks

Using Python™ as a ‘binder’ language coupled with the TC-Python API, Thermo-Calc offers a fully customizable foundation for developing workflows that automate data processing workflows (such as cleansing or generating data), and integrating predictive capabilities into broader ML frameworks.

Here are some ways TC-Python and ML can work together:

Automating High-Throughput Calculations

TC-Python enables automated, high-throughput calculations across many compositions and conditions. This automation is essential for generating diverse datasets and covering broad compositional and processing spaces. For instance, scripting with TC-Python allows researchers to simulate phase stability efficiently, saving data automatically for ML training.

Formatting Data

Typical formats for storing training data for ML include structured formats like CSV, Excel spreadsheets, and HDF5 (Hierarchical Data Format). For larger datasets, particularly when handling multidimensional arrays or large-scale simulations, Parquet or SQL databases may also be used. These formats are chosen for their compatibility with ML libraries such as TensorFlow or PyTorch and ease of access in data processing tools like Pandas.

TC-Python can export results directly into CSV or Excel for straightforward integration with ML algorithms and for complex datasets or larger-scale projects, TC-Python can store data in HDF5 or SQL databases, preserving data structure and facilitating fast read/write speeds. Additionally, TC-Python’s integration with Python libraries like Pandas allows for direct manipulation of data, enabling preprocessing (e.g., feature extraction, scaling) before final storage. This versatility makes TC-Python a powerful tool for generating, structuring, and exporting high-quality datasets tailored for ML applications in materials science.

Integrating into ICME Frameworks

In ICME frameworks, TC-Python can also be used to link Thermo-Calc with ML and other simulation tools, enabling end-to-end simulations from processing to properties, enhancing understanding of material behavior.

Storage of Datasets and End User License Agreement

The use of Thermo-Calc Software products is governed by the terms of our End User License Agreement. Specifically, Section 7(f) refers to some restrictions which should be considered when using the software in conjunction with ML.

You are in no event allowed to, without the prior written permission of TCSAB:

Make available to third parties, or publish or store in a public data repository, systematic mappings of results generated by the Program. This means that when publishing papers related to the development of ML models, any substantial amounts of training data generated by Thermo-Calc Software products should not be provided in structured formats like those described above, without the prior permission of TCSAB. Additionally, if the ML model is being developed for, or is to be made available to third parties, then please contact TCS AB to discuss the terms under which such rights can be granted.
Generate and disseminate look-up tables, surrogate models, neural networks or similar based on the results generated by the Program and which could circumvent the need of the Program to obtain the result(s) or compete with the Program, directly or indirectly. This means that if an ML trained surrogate model is developed purely for the purpose of circumventing the need to use the program in the future, then this is prohibited under the terms of use – this is no different than creating a derivative work. However, developing surrogate models, within a narrow scope such as those described in the examples above, is allowed provided they are for internal use only.

If you have questions around the interpretation of the EULA in regard to your own work, please contact TCSAB.