Data Science & Machine Learning Basics
This page is my personal repository of most common and useful machine learning algorithms using Python and other data science tricks and tips.
\(\text{Data Science}\)
Data science involves extracting knowledge from structured and unstructured data. It combines principle from statistics, machine learning, data analysis, and domain knoledge to understand and interpret the data
Data Collection & Accuisition
- Web srcaping: Data collection through Webscraping
- API integration
- Data Lakes, Data Warehouse
Data Cleaning & Preprocessing
This involves Handling Missing Values, Data Transformation, Feature Engineering, Encoding Categorical Variables, Handling Outliers
Exploratory Data Analysis (EDA)
This usually includes the Descriptive Statistics, Data Visualization, Identifying Patterns, Trends, Correlations of the features and labels.
Statistical Methods
- ANOVA - Categorical Features’: How do we treat the categorical features for our data science project?
- Hypothesis Testing
- Probability Distributions
- Inferential Statistics
- Sampling Methods
Big Data Techniques
- Hadoop, Spark
- Distributed Data Storage (e.g., HDFS, NoSQL)
- Data PipeLines, ETL (Extract, Transform, Load)
\(\text{Machine Learning Algorithms}\)
\(\text{Supervised Learning}\)
(Training with labeled data: input-output pairs)
Regression
Classification
Parametric
- Logistic Regression
- Naive Bayes
- Linear Discriminant Analysis (LDA)
- Quadratic Discriminant Analysis (QDA)
Non-Parametric
Multi-Class Classification
Bayesian or Probabilistic Classification
- What is Bayesian or Probabilistic Classification?
- Linear Discriminant Analysis (LDA)
- Quadratic Discriminant Analysis (QDA)
- Naive Bayes
- Bayesian Network Classifier (Tree Augmented Naive Bayes (TAN))
Non-probabilistic Classification
\(\text{Unsupervised Learning}\)
(Training with unlabeled data)
Clustering
- k-Means Clustering
- Hierarchical Clustering
- DBSCAN (Density-Based Spatial Clustering)
- Gaussian Mixture Models (GMM)
Dimensionality Reduction
- Principal Component Analysis
- Latent Dirichlet Allocation (LDA)
- t-SNE (t-distributed Stochastic Neihbor Embedding)
- Factor Analysis
- Autoencoders
Anomaly Detection
- Isolation Forests
- One-Class SVM
\(\text{Semi-Supervised Learning}\)
(Combination of labeled and unlabeled data)
- Self-training
- Co-training
- Label Propagation
\(\text{Reinforcement Learning}\)
(Learning via rewards and penalties)
- Markov Decision Process (MDP)
- Q-Learning
- Deep Q-Networks (DQN)
- Policy Gradient Method
\(\text{Deep Learnings}\)
- PyTorch
- Artificial Neural Networks (ANN)
- Convolutional Neural Networks (CNN)
- Recurrent Neural Networks (RNN)
- Long Short-Term Memory (LSTM)
- Generative Adversarial Networks (GAN)
\(\text{Model Evaluation and Fine Tuning}\)
Model Evaluation Metrics
- For Regression: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), \(R^2\) score
- For Classification: Accuracy, Precision, Recall, F1 Score, ROC-AUC
- Cross-validation: kFold, Stratified k-fold, leave-one-out
Model Optimization
- Bias-Variance: Bias Variance Trade off
- Hyperparameter Tuning: Grid Search, Random Search, Bayesian Optimization
- Features Selection Techniques: Recursive Feature Elimination (RFE), L1 or Rasso Regurlarization, L2 or Ridge Regularization
- Model Interpretability: SHAP (Shapley values), LIME (Local Interpretable Model-agnostic Explanations)
Ensemble Methods
- Bagging: Random Forest, Bootstrap Aggregating
- Boosting: Gradient Boosting, AdaBoost, XGBoost, CatBoost
- Stacking: Stacked Generalization
You may also like