Disaster Tweets Classification
- Kyuta Yasuda
- Jan 28
- 2 min read
Project Description
This project aims to classify tweets as related to disasters or not using machine learning. By leveraging a Random Forest model combined with text preprocessing techniques, the model predicts whether a tweet indicates a disaster-related event.
Key Highlights
Dataset: A labeled dataset containing tweets with information about their disaster relevance (target: 0 or 1).
Objective: Build a machine learning pipeline to preprocess tweet data and classify tweets as disaster-related or not.
Tools and Libraries:
Python
Libraries: pandas, scikit-learn, and TfidfVectorizer for text preprocessing, model training, and evaluation.
Jupyter Notebook for implementation.
Project Workflow
Data Preprocessing:
Utilized TfidfVectorizer to transform textual features (text, keyword, location) into numerical vectors.
Applied feature extraction with a maximum of 1000 features for text and 100 features for keyword and location.
Handled missing values and normalized text data.
Feature Engineering:
Created a preprocessing pipeline using ColumnTransformer for efficient feature handling.
Model Implementation:
Used a Random Forest Classifier with specific hyperparameters:
Random State: 42
Integrated the preprocessing pipeline with the classifier using scikit-learn's Pipeline.
Cross-Validation:
Performed K-fold cross-validation with k=5 to assess model stability and performance.
Results:
Mean Accuracy: 77.58%
Standard Deviation: 1.30%
Evaluation:
Evaluated the model on a test set, achieving the following metrics:
Accuracy: 78%
Precision (Class 0): 77%
Recall (Class 1): 66%
F1-Score (Class 1): 72%
Learning Outcomes
Developed a complete pipeline for text preprocessing and model training.
Enhanced understanding of text feature extraction techniques such as TF-IDF.
Learned to integrate cross-validation for robust model evaluation.
Practiced exporting and visualizing prediction results in a Kaggle-style competition.
Next Steps
Experiment with advanced models like Gradient Boosting or deep learning (e.g., LSTM, BERT) for improved accuracy.
Perform hyperparameter tuning using Grid Search or Bayesian Optimization.
Include additional features such as word embeddings for better semantic understanding.
Comentarios