ML Libraries

Machine learning and the algorithms that power it have been developing for several decades. Some of the algorithms are proprietary, and some are open source.

Among the open-source libraries, the following are prominent:

scikit-learn (sci is pronounced "sy" and stands for science): This is by far the most popular library for traditional machine learning in Python. It provides a vast collection of algorithms and tools. Hundreds of developers have contributed to this effort.
WEKA: Developed by the University of Waikato, this is a popular open-source library originally used by Java developers. However, Python, C#, and Groovy developers can also use it now.
TensorFlow: An open-source, full ecosystem for model development and deployment. Famous for its deep learning algorithms, it was originally developed and released by Google.
Keras: Keras is a high-level API that simplifies deep learning model development. It can run on top of several backends, including TensorFlow. Today, Keras is integrated directly into TensorFlow as tf.keras and is the recommended way to build models with TensorFlow.
PyTorch: An open-source library originally developed by Facebook, famous for its deep learning algorithms. It is a major competitor to TensorFlow, often praised for its Pythonic feel and flexibility in research environments.
PyCaret: An open-source, low-code wrapper module for several machine learning libraries and frameworks, such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and many more.
TensorFlow Hub: This is a repository of pre-trained models that data scientists have made available to the public. Using a pre-trained model can save significant time and computational resources. You could also use models that are bundled with Keras by importing tf.keras.applications.

In this eBook, we are predominantly using scikit-learn algorithms.

Sci-kit Learn (module name 'sklearn')

This library is already present in Colab and Anaconda. If you are using either of these environments, you don't need to install it separately. You can simply import the entire sklearn library or just the specific functions you need.

Estimator Object

In scikit-learn, an estimator is the base object for all algorithms. It's an object that learns from data. It can be a classification, regression, or clustering algorithm, or a transformer that extracts or filters useful features from raw data. So, everything starts with constructing an estimator in scikit-learn.

fit method

All estimator objects have a fit(X, y) method. This method is used to train the model.

X: A 2D array representing the features of the training data.
y: A 1D array representing the target variable (the outcome). y is optional for unsupervised learning estimators. This method trains the estimator in place and returns the trained model. It's important to remember that all data in X and y must be numerical.

Estimator parameters (hyper-parameter settings)

You can instantiate an estimator with no parameters, in which case it uses the defaults, or you can pass the parameters as key/value pairs in the instantiation statement. These are used as hyperparameters.

Estimated Parameters

When an estimator is trained using the fit method, it learns model parameters from the data. These learned parameters can be accessed as attributes on the estimator object. By convention, all learned attributes in scikit-learn have a trailing underscore (_). For e.g., in a linear regression, you can get the estimator (model) coefficients using coef_. However, each estimator will have a different set of attributes that you can access based on the type of algorithm you picked.

sklearn.preprocessing

This module consists of many functions that are typically used for preprocessing the dataset, such as applying LabelEncoding, OneHotEncoding, etc.

fit-transform method

For transformer estimators, there is a convenient fit_transform(X) method. This single method first learns the parameters from the data (fit) and then applies the transformation (transform). For example, calling fit_transform on a one-hot encoding object will learn the categories and return the encoded data in one step.