Disaster Tweets Classification

Kyuta Yasuda
Jan 28
2 min read

https://www.kaggle.com/code/kyutayasuda/disaster-tweet

Project Description

This project aims to classify tweets as related to disasters or not using machine learning. By leveraging a Random Forest model combined with text preprocessing techniques, the model predicts whether a tweet indicates a disaster-related event.

Key Highlights

Dataset: A labeled dataset containing tweets with information about their disaster relevance (target: 0 or 1).
Objective: Build a machine learning pipeline to preprocess tweet data and classify tweets as disaster-related or not.
Tools and Libraries:
- Python
- Libraries: pandas, scikit-learn, and TfidfVectorizer for text preprocessing, model training, and evaluation.
- Jupyter Notebook for implementation.

Project Workflow

Data Preprocessing:
- Utilized TfidfVectorizer to transform textual features (text, keyword, location) into numerical vectors.
- Applied feature extraction with a maximum of 1000 features for text and 100 features for keyword and location.
- Handled missing values and normalized text data.
Feature Engineering:
- Created a preprocessing pipeline using ColumnTransformer for efficient feature handling.
Model Implementation:
- Used a Random Forest Classifier with specific hyperparameters:
  - Random State: 42
- Integrated the preprocessing pipeline with the classifier using scikit-learn's Pipeline.
Cross-Validation:
- Performed K-fold cross-validation with k=5 to assess model stability and performance.
- Results:
  - Mean Accuracy: 77.58%
  - Standard Deviation: 1.30%
Evaluation:
- Evaluated the model on a test set, achieving the following metrics:
  - Accuracy: 78%
  - Precision (Class 0): 77%
  - Recall (Class 1): 66%
  - F1-Score (Class 1): 72%

Learning Outcomes

Developed a complete pipeline for text preprocessing and model training.
Enhanced understanding of text feature extraction techniques such as TF-IDF.
Learned to integrate cross-validation for robust model evaluation.
Practiced exporting and visualizing prediction results in a Kaggle-style competition.

Next Steps

Experiment with advanced models like Gradient Boosting or deep learning (e.g., LSTM, BERT) for improved accuracy.
Perform hyperparameter tuning using Grid Search or Bayesian Optimization.
Include additional features such as word embeddings for better semantic understanding.

Disaster Tweets Classification

Project Description

Key Highlights

Project Workflow

Learning Outcomes

Next Steps

Recent Posts

Comments

Project Description

Key Highlights

Project Workflow

Learning Outcomes

Next Steps

Comments

​