Credit Score Model

Credit Score Model#

Credit Scoring

Tool for classifying customers to reduce current and expected credit risk.
Defined as the process of modeling creditworthiness (Hand and Jacka, 1998).
Involves transforming relevant data into numerical measures for guiding credit decisions (Anderson, 2007).
A credit scoring model estimates the probability of default, indicating the likelihood of a credit event like bankruptcy or failure to pay.
The output of such a model is typically a credit score; a higher score indicates a lower risk of default.
Credit factors vary by loan type: for credit card loans, factors might include payment history and credit utilization, while for mortgages they could include down payment and job history.
The accuracy of these models is crucial for maximizing financial institutions’ risk-adjusted returns.
Economic fluctuations like recessions or expansions necessitate that models be adaptable and quickly adjustable by risk managers and credit analysts.

Common Techniques in Credit Scoring Model Development and Validation#

Logistic regression and linear regression
Machine learning and predictive analytics
Gini Coefficients
Binning algorithms (e.g., monotone, equal frequency, and equal width)
Cumulative Accuracy Profile (CAP)
Receiver operating characteristic (ROC)
Kolmogorov-Smirnov (K-S) statistic

Credit Score Model Types#

Traditional Statistical Models
- Logistic Regression: Still widely used for its simplicity and interpretability.
- Decision Trees: Used for their ability to handle non-linear relationships and interactions between variables.
Machine Learning Models
- Random Forests: An ensemble method that uses multiple decision trees to improve predictive accuracy and control over-fitting.
- Gradient Boosting Machines (GBM): Such as XGBoost and LightGBM, which build models in a stage-wise fashion and are highly effective for classification tasks like credit scoring.
- Support Vector Machines (SVM): Effective in high-dimensional spaces and used for both regr
- Reinforcement Learning
- Graph-based modelession and classification.
Deep Learning Models
- Neural Networks: Including feedforward neural networks and more complex architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). These models can capture complex patterns in large datasets.
- Autoencoders: Used for anomaly detection and feature extraction in credit scoring.
Hybrid Models
- Ensemble Methods: Combining multiple models (e.g., blending logistic regression with gradient boosting) to improve robus
  - Homogenous ensemble classifiers: (1) independent base models (e.g., bagging algo), (2) dependent base models (e.g., boosting algo).
  - Heterogenous ensemble classifiers: different classification algos (e.g., logistic, random forests, etc.). If you prune some base models beforehand, it’s called selective ensembles (either static or dynamic). tness and predictive performance.
- Stacking: A technique where predictions from multiple models are used as inputs to a higher-level model.
Alternative Data and Big Data Techniques
- Use of Alternative Data: Incorporating non-traditional data sources such as social media activity, utility payments, and other digital footprints to enhance credit scoring models.
- Big Data Analytics: Leveraging large and diverse datasets to improve model accuracy and insights.
Explainable AI (XAI) Models
- SHAP (SHapley Additive exPlanations): Used to interpret complex models by assigning importance values to each feature.
- LIME (Local Interpretable Model-agnostic Explanations): Provides explanations for individual predictions made by black-box models.
Regulatory and Ethical Considerations
- Fairness and Bias Mitigation: Incorporating techniques to ensure models are fair and do not discriminate against protected groups.
- Transparency: Ensuring that models can be explained and understood by stakeholders, including regulatory bodies.

Examples of Implementations#

FICO Score 10: Incorporates trended data to provide a more comprehensive view of an individual’s credit behavior over time.
VantageScore 4.0: Uses machine learning techniques and includes data on credit usage patterns, payment history, and total debt.

Augmenting Hybrid Credit Score Models with Alternative Datasets#

Steps to Approach this:

Data Integration
- Identify Relevant Features: Determine which aspects of social media activity are relevant to credit scoring (e.g., frequency of posts, sentiment analysis, network size, engagement metrics).
- Combine Datasets: Merge traditional credit data with social media data for individuals who have a social media presence. Ensure that data from different sources are aligned properly.
Handling Missing Data
- Indicator Variables: Create binary indicator variables to mark the presence or absence of social media data for each individual.
- Separate Models: Train separate models for individuals with and without social media data. Combine the predictions using a meta-model.
- Imputation: Use imputation techniques to handle missing social media data, though this should be done cautiously to avoid introducing bias.
Feature Engineering
- Extract Features: Use natural language processing (NLP) and other techniques to extract features from social media text (e.g., sentiment scores, topic modeling).
- Engagement Metrics: Include metrics like the number of friends/followers, frequency of posts, and interaction rates.
Modeling Approach
- Hybrid Model Structure: Use a hybrid model structure where social media features are added as additional inputs for the machine learning model. This could be an ensemble model where different data sources contribute to the final prediction.
- Stacking and Blending: Employ stacking or blending techniques where base models (one using traditional data and one using augmented data) are combined by a meta-learner.
Training and Validation
- Separate Training: Train the model on individuals with complete traditional and social media data. Validate on a subset to ensure robustness.
- Cross-validation: Use cross-validation to test the performance of the model and prevent overfitting.
- Fairness Checks: Ensure that the inclusion of social media data does not introduce bias or unfair discrimination.
Model Interpretation and Explainability
- Explainable AI Tools: Use tools like SHAP or LIME to interpret the impact of social media features on the model’s predictions.
- Transparency: Maintain transparency about how social media data is used and ensure compliance with privacy regulations.
Ethical and Privacy Considerations
- Consent and Privacy: Ensure that individuals consent to the use of their social media data and that privacy regulations (e.g., GDPR) are strictly followed.
- Ethical Use: Be transparent about the use of social media data and ensure it is used ethically, without leading to discriminatory practices.

Example Workflow#

Data Collection:
- Collect traditional credit data (e.g., credit history, loan repayment).
- Collect social media data (e.g., public posts, engagement metrics) for consenting individuals.
Feature Engineering:
- Extract features from both datasets.
- Create indicator variables for the presence of social media data.
Model Development:
- Develop base models for traditional data and augmented data separately.
- Combine these models using an ensemble or stacking approach.
Model Training:
- Train the hybrid model using a combined dataset.
- Validate using cross-validation techniques.
Model Interpretation:
- Use SHAP/LIME to understand the contribution of social media features.
Implementation:
- Deploy the model ensuring it meets regulatory and ethical standards. s.

According to West [Wes00]:

Neural Network Models:
- Multilayer Perceptron
- Mixture-of-Experts (recommended)
- Radial Basis Function (recommended)
- Learning Vector Quantization
- Fuzzy Adaptive Resonance
Traditional Methods:
- Linear Discriminant Analysis
- Logistic Regression (best among traditional methods)
- k-Nearest Neighbor
- Kernel Density Estimation
- Decision Trees

Chuang and Lin [CL09] introduces a two-stage Reassigning Credit Scoring Model (RCSM) to improve accuracy and reduce Type I errors.

The first stage involves constructing an ANN-based model to classify credit applicants as either accepted (good) or rejected (bad).
The second stage reduces Type I errors by reassigning mistakenly rejected good applicants to a “conditionally accepted” category using a CBR-based classification technique.

The RCSM was tested on a credit card dataset from the UCI repository and demonstrated greater accuracy compared to four other commonly used approaches.
- Linear Discriminant Analysis
- Logistic Regression
- CART (Classification and regression tree)
- MARS (Multivariate adaptive regression spline)
- ANNs (Artificial neural networks)
- CBR (Case-based reasoning)

Credit Score card model is easier to interpret and deploy Yap et al. [YOH11].

Hlongwane et al. [HRM24] implement

XGBoost ( [GVBB+21] prefers)
LightGBM
CatBoost
Model-X knockoffs

Feature Selection#

Information gain, gain ratio, and chi-square: Trivedi et al. (2020) [Tri20]
Neighbourhood rough set (NRS): Tripathi and Aggarwal (2018) [TEC18]
Other methods: Nalic et al. (2020) [NalicMartinovicvZagar20]

Credit Score for Business#

SMEs: Roy et al. (2023) [RS23]

Alternative Data Sources#

Email usage and psychometric variables: Djeundje et al. (2021) [DCCH21], Arraiz et al. (2017) [ArraizBS17]
Social Media data: Wei et al. (2016) [WYVdBD16], Ge et al. (2017) [GFGZ17], de Souza et al. (2019) [DCMS+19]
Telecommunication: Oskarsdottir et al. (2019) [OskarsdottirBS+19], Ots and Li (2020) [OLT20], de Montjoye et al. (2011) [dOKCC+11], Pedro et al. (2015) [PPO15], Agarwal et al. (2018) [ALCS18]

Related Issues#

Credit card fraud detection: Zhang et al. (2021) [ZHXW21] ng2021hoba

According to [AP11], common models include:

MOE: Mixture of Experts
RBF: Radial Basis Function
MLP: Multilayer Perceptron
LVQ: Learning Vector Quantization
FAR: Fuzzy Adaptive Resonance (all of which are neural network models)
Multilayer Feed-Forward Neural Network
Weight-of-Evidence Model
Genetic Programming
Support Vector Machines

Criteria	Calculation	Description	Pros	Cons	Applications	Global/Local
Confusion Matrix	NA	Summarizes prediction results, showing the number of true positives, false positives, true negatives, and false negatives.	Simple to understand and interpret; useful for binary classification.	Doesn’t provide information on the cost of misclassification; limited to binary classification.	Binary classification problems.	Local
MSE/RMSE	$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$ $\text{RMSE} = \sqrt{\text{MSE}}$	Measures the average squared difference between actual and predicted values. RMSE is the square root of MSE, giving error in the same unit as the output.	Provides a clear indication of prediction accuracy; RMSE is in the same units as the output.	Sensitive to outliers; may not reflect actual performance for non-normally distributed errors.	Regression problems.	Global
MAE	$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \|y_i - \hat{y}_i\|$	Measures the average absolute difference between actual and predicted values, less sensitive to outliers than MSE.	Robust to outliers; easier interpretation compared to MSE.	Doesn’t penalize larger errors more than smaller ones; less sensitive to large deviations.	Regression problems.	Global
Mean Error	$\text{Mean Error} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)$	Measures the average difference between actual and predicted values, giving a simple error estimate.	Simple to compute and interpret; good for initial assessment.	Can be misleading if not combined with other metrics; does not penalize larger errors.	Initial error assessment.	Global
R² / Adj R²	$R² = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}$ $\text{Adj R²} = 1 - \frac{(1-R²)(n-1)}{n-p-1}$	Measures the proportion of variance in the dependent variable explained by the model. Adjusted R² adjusts for the number of predictors.	Gives insight into model’s explanatory power; Adjusted R² corrects for overfitting.	May not indicate model’s performance on different datasets; Adjusted R² can be complex to interpret.	Evaluating explanatory power of regression models.	Global
Sensitivity/Specificity (ROC Curve)	$\text{Sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}}$ $\text{Specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}$	Analyzes the true positive rate against the false positive rate, used to evaluate the diagnostic ability of a binary classifier.	Useful for evaluating the trade-off between sensitivity and specificity; provides a graphical representation.	Doesn’t consider the costs of false positives and false negatives; ROC analysis can be complex.	Evaluating diagnostic tests, medical testing.	Global
Discrimination (C-statistic/AUC)	$AUC = \int_{0}^{1} \text{ROC}(t) \, dt$	Measures the model’s ability to discriminate between different classes, AUC is the area under the ROC curve.	Provides a single metric for model’s discriminative power; widely used and understood.	May not provide detailed error analysis; AUC interpretation can be context-dependent.	Binary classification, credit scoring.	Global
Accuracy (ACC)	$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{Total}}$	Measures the overall accuracy of the model by comparing the number of correct predictions to the total predictions.	Simple to interpret; provides a clear indication of overall model accuracy.	Does not provide insights into the types of errors made; may be misleading in imbalanced datasets.	Classification problems.	Global
Estimated Misclassification Cost	$\text{Cost} = \sum_{i=1}^{n} \text{Cost}(y_i, \hat{y}_i)$	Estimates the cost associated with incorrect classifications, taking into account different types of errors.	Provides a cost-based evaluation of model performance, useful in applications where error costs differ.	Requires estimation of costs, which can be subjective and may vary by context.	Applications where misclassification costs vary, such as fraud detection.	Local
Percentage Correctly Classified (PCC)	$\text{PCC} = \frac{\text{TP} + \text{TN}}{\text{Total}} \times 100$	Measures the percentage of correct predictions relative to the total predictions.	Simple and easy to interpret; provides a basic measure of model performance.	Doesn’t account for the balance between classes; may be misleading in imbalanced datasets.	General classification problems.	Global
Partial Gini Index (PG)	$\text{PG} = \frac{\sum_{i=1}^{n} (y_i \text{Gini})}{\sum_{i=1}^{n} y_i}$	Measures the performance of the model relative to a partial area under the ROC curve.	Focuses on the performance in a specific part of the ROC curve; useful for targeted performance analysis.	Requires setting a specific threshold for evaluation; can be complex to calculate and interpret.	Targeted model performance analysis.	Local
H-measure	NA	A metric designed to overcome some limitations of AUC by incorporating misclassification costs and prevalence.	Incorporates misclassification costs; robust to variations in prevalence.	Requires estimation of costs and prevalence; more complex than traditional metrics.	Binary classification, situations with varied misclassification costs.	Global
Brier Score (BS)	$\text{BS} = \frac{1}{n} \sum_{i=1}^{n} (p_i - o_i)^2$	Measures the mean squared difference between predicted probabilities and actual outcomes.	Provides a probabilistic assessment of model performance; accounts for prediction confidence.	Sensitive to the calibration of predicted probabilities; can be less intuitive than other metrics.	Evaluating probabilistic forecasts.	Global
Kolmogorov–Smirnov Statistic (KS)	$\text{KS} = \max_{t} \|F_1(t) - F_2(t)\|$	Measures the maximum difference between the cumulative distribution functions of the observed and predicted distributions.	Provides insight into the separation of the distributions; useful for comparing model performance.	Can be sensitive to sample size; may require large sample sizes for reliable estimates.	Model performance comparison, goodness-of-fit tests.	Global

[LBST15]

Individual Classification:

Bayesian network CART (Classification and Regression Trees) Extreme Learning Machine Kernelized ELM (Extreme Learning Machine) K-Nearest Neighbor J4.8 (an implementation of C4.5 in WEKA) LDA (Linear Discriminant Analysis) Linear Support Vector Machine Logistic Regression Multilayer Perceptron Artificial Neural Network Naive Bayes Quadratic Discriminant Analysis Radial Basis Function Network SVM with Radial Basis Kernel Function Voted Perceptron

Homogeneous Ensembles:

Alternating Decision Tree Bagged Decision Trees Bagged MLP (Multilayer Perceptron) Boosted Decision Trees Logistic Model Tree Random Forest Rotation Forest Stochastic Gradient Boosting

Heterogeneous Ensembles:

Simple Average Ensemble Weighted Average Ensemble Stacking Complementary Measure Ensemble Pruning via Reinforcement Learning GASEN (Genetic Algorithm-based Selective Ensemble) Hill-Climbing Ensemble Selection HCES (Hill-Climbing Ensemble Selection) with Bootstrap Sampling Matching Pursuit Optimization Ensembles Top-T Ensemble Clustering Using Compound Error K-Means Clustering Kappa Pruning Margin Distance Minimization Uncertainty Weighted Accuracy Probabilistic Model for Classifier Competence K-Nearest Oracle

Approaches to Handle Imbalanced Data#

1. Data-Level Approaches#

a. Oversampling Techniques#

SMOTE (Synthetic Minority Over-sampling Technique): Generates synthetic examples of the minority class by interpolating between existing samples.
ADASYN (Adaptive Synthetic Sampling): Similar to SMOTE but focuses more on generating synthetic samples for harder-to-classify instances.
Random Oversampling: Replicates minority class examples until the dataset is balanced.

b. Undersampling Techniques#

Random Undersampling: Randomly removes instances from the majority class to balance the dataset.
Cluster-Based Undersampling: Clusters the majority class and removes samples based on cluster proximity.
NearMiss: Selects majority samples closest to the minority samples or farthest from other majority samples.

c. Hybrid Techniques#

SMOTE-Tomek Links: Combines SMOTE with Tomek links to remove overlapping samples from the majority class.
SMOTE-ENN (Edited Nearest Neighbors): Uses SMOTE for oversampling and ENN for cleaning the dataset by removing misclassified instances.

2. Algorithm-Level Approaches#

a. Cost-Sensitive Learning#

Cost-Sensitive Classifiers: Adjusts the learning process to minimize the cost of misclassifications, such as higher penalties for minority class errors.
Weighted Loss Functions: Assigns different weights to classes in the loss function to emphasize the minority class.

b. Ensemble Methods#

Balanced Random Forest: Modifies the random forest algorithm to balance each bootstrap sample.
EasyEnsemble: Combines multiple weak learners trained on different balanced subsets of the majority class.
RUSBoost (Random UnderSampling with Boosting): Integrates undersampling with boosting techniques.

c. Anomaly Detection#

One-Class SVM: Treats the minority class as the target and identifies it against the majority background.
Isolation Forest: Detects outliers, assuming the minority class can be seen as anomalies.

3. Deep Learning Approaches#

a. Data Augmentation#

GANs (Generative Adversarial Networks): Generate realistic minority class samples to augment the dataset.
Autoencoders: Learn latent features of the minority class and use them to generate new samples.

b. Transfer Learning#

Feature Transfer: Uses features learned from a balanced or related task to improve minority class recognition.
Fine-Tuning: Fine-tunes pre-trained models on the imbalanced dataset to leverage general features.

c. Specialized Architectures#

Focal Loss: Modifies the cross-entropy loss to focus more on hard-to-classify examples, often used in object detection tasks.
Class-Balanced Loss: Scales the loss by the inverse of the class frequency to balance the influence of each class. s frequency to balance the influence of each class.

Cross-Validation#

Cross-validation is a statistical method used to estimate the skill of machine learning models. It is primarily used in applied machine learning to estimate the predictive power of a model on new data. Cross-validation involves partitioning a dataset into a training set and a test set, training the model on the training set, and evaluating it on the test set. This process is repeated multiple times with different splits to reduce variability and obtain a more accurate measure of model performance.

1. Nested Cross-Validation#

Nested cross-validation is used for model selection and hyperparameter tuning while avoiding overfitting.

Inner Loop: Performs cross-validation for hyperparameter tuning.
Outer Loop: Evaluates the model performance using the optimal hyperparameters found in the inner loop.
Provides an unbiased estimate of the model’s performance.

2. Monte Carlo Cross-Validation (Repeated Random Subsampling Validation)#

Randomly splits the dataset into training and test sets multiple times (more than two).
Averages the performance metrics across different splits.
Offers a better approximation of model performance by considering varied data splits.

3. Stratified K-Fold Cross-Validation#

Ensures each fold has a representative proportion of each class.
Important for imbalanced datasets.
Reduces biased performance estimates due to uneven class distribution.

4. Leave-One-Out Cross-Validation (LOOCV)#

Uses each sample once as a test set, with the remaining samples as the training set.
Provides a nearly unbiased performance estimate but is computationally intensive.
Suitable for small datasets.

5. Group K-Fold Cross-Validation#

Ensures that samples from the same group (e.g., same subject or time period) are not split across different folds.
Useful for datasets where samples are not independent and identically distributed (i.i.d.).

6. Time Series Cross-Validation (Rolling Forecasting Origin)#

Designed for time series data where the order of observations is crucial.
Uses a rolling window approach to train on past data and test on future data.
Preserves temporal order, making it suitable for time-dependent datasets.

To partition data for model comparison in the credit scoring model:

Stratified K-Fold Cross-Validation:

Purpose: To create training and test sets for all models.
Process: Partition the data into K folds, ensuring each fold has a representative proportion of each class.

Nested Cross-Validation:

Applied to Each Training Set: Use the training sets obtained from stratified K-fold.
Inner Loop:
- Purpose: Train the model.
- Process: Further split the training set into inner training and validation sets to train the model.
Outer Loop:
- Purpose: Hyperparameter tuning.
- Process: Use the performance on the inner validation sets to tune hyperparameters and evaluate model performance.

Since we have different objective functions based on various metrics, we need to repeat the process for each optimization metric. Moreover, it is important to assess the correspondence of classifier performance across these metrics. Specifically, we can use the agreement of classifier rankings across accuracy indicators by applying Kendall’s rank correlation coefficient. This helps us determine whether the metrics have high agreement and provide consistent recommendations (the best case), or if they disagree. If there is disagreement, we can decide which metric to focus on, whether it be a local or global assessment.

Profitability Calculation#

The goal of this calculation is to estimate the profitability of a credit scoring model (or scorecard) by analyzing the costs associated with classification errors—specifically, false positives and false negatives. This involves determining how often good credit risks are wrongly classified as bad (False Positive Rate, FPR) and bad credit risks are wrongly classified as good (False Negative Rate, FNR), and then weighting these errors by their respective costs.

Key Concepts#

False Positive Rate (FPR): The fraction of good credit risks that are incorrectly classified as bad.
False Negative Rate (FNR): The fraction of bad credit risks that are incorrectly classified as good.
Misclassification Costs:
- $C(+ | -)$: The opportunity cost of denying credit to a good risk. This is the cost incurred when a good applicant is mistakenly rejected.
- $C(- | +)$: The cost of granting credit to a bad risk. This includes financial losses, often quantified as the net present value of exposure at default (EAD) times the loss given default (LGD).

Calculation#

The misclassification cost of a scorecard, ( C(s) ), is calculated using the formula:

\[ C(s) = C(+|−) \cdot \text{FPR} + C(−|+) \cdot \text{FNR} \]

Here’s a step-by-step breakdown:

Determine the Costs:
- $C(+ | -)$ represents the cost of wrongly denying credit to a good risk.
- $C(- | +)$ represents the cost of wrongly granting credit to a bad risk.
Calculate the FPR and FNR:
- FPR is the proportion of good applicants that are incorrectly classified as bad.
- FNR is the proportion of bad applicants that are incorrectly classified as good.
Combine the Costs and Rates:
- Multiply the cost of each type of error by its rate to get the weighted costs.
Sum the Weighted Costs:
- Add the weighted FPR and FNR to get the total misclassification cost for the scorecard.

Cost Ratios and Scenarios#

To cover different scenarios, the calculation considers various ratios of $C(+ | -)$ to $C(- | +)$, assuming that it is generally more costly to grant credit to a bad risk than to reject a good application. For example, the ratios range from 1:2 to 1:50. By fixing $C(+ | -)$ at 1 and varying $C(- | +)$, the analysis can explore how different misclassification costs impact the profitability estimation.

Normalization and Comparison#

Compute Misclassification Costs: For each cost setting and credit scoring dataset, the misclassification costs $C(s)$ are calculated using the formula above.
Estimate Expected Error Costs: These costs are averaged over different datasets to get an overall estimate.
Normalize Costs: The costs are normalized to represent percentage improvements compared to a baseline model, such as a logistic regression (LR) model.

Example for Clarification#

Assume:

$C(+ | -) = 1$ (Opportunity cost for rejecting a good applicant).
$C(- | +) = 10$ (Cost for approving a bad applicant).
FPR = 0.05 (5% of good applicants are wrongly rejected).
FNR = 0.10 (10% of bad applicants are wrongly approved).

Calculation: $$ C(s) = 1 \cdot 0.05 + 10 \cdot 0.10 = 0.05 + 1 = 1.05 $$

This result suggests that the total misclassification cost for this scorecard, given the specified costs and error rates, is 1.05.

Machine Learning Models#

Linear Models:
- Linear Discriminant Analysis (LDA)
- Logistic Regression (LR)
- Naïve Bayes (NB)
Instance-Based Learning:
- k-Nearest Neighbor (k-NN)
Decision Trees:
- Decision Trees (DTs)
- Random Forests (RFs)
Support Vector Machines (SVMs)
Neural Networks:
- Artificial Neural Networks (ANNs)
- Convolutional Neural Networks (CNNs)
- Deep Multi-Layer Perceptron (DMLP)
- Restricted Boltzmann Machines (RBMs)
- Deep Belief Networks (DBNs)
Ensemble Methods:
- Boosting
- Extreme Gradient Boost (XGBoost)
- Bagging

Feature Selection Methods#

1. Filter Methods#

F-score:
- Measures how well a feature discriminates between two sets of data.
- Formula: Compares average values of a feature across the whole dataset, positive instances, and negative instances.
Rough Set Theory:
- Defines important features based on the indiscernibility relation.
- Uses subsets of features to find a reduced set of important features.

2. Wrapper Methods#

Stepwise Selection:
- Forward Selection: Adds features one by one based on significance.
- Backward Elimination: Starts with all features and removes insignificant ones.
- Stepwise Feature Selection: Combines forward and backward methods.
Genetic Algorithm:
- Evolves a population of solutions using selection, crossover, and mutation.
- Uses a fitness score to measure model performance (e.g., classification accuracy).
- Balances exploration (searching new regions) and exploitation (using known information) to find the best feature subsets.

3. Embedded Methods#

LASSO (Least Absolute Shrinkage and Selection Operator):
- Uses L1-penalized regression to select features.
- Objective: Minimize prediction error with a penalty for the number of features.
- Simplifies the model by reducing coefficients of less important features.