top of page
Search

Disaster Tweets Classification

  • Writer: Kyuta Yasuda
    Kyuta Yasuda
  • Jan 28
  • 2 min read

Project Description

This project aims to classify tweets as related to disasters or not using machine learning. By leveraging a Random Forest model combined with text preprocessing techniques, the model predicts whether a tweet indicates a disaster-related event.


Key Highlights

  1. Dataset: A labeled dataset containing tweets with information about their disaster relevance (target: 0 or 1).

  2. Objective: Build a machine learning pipeline to preprocess tweet data and classify tweets as disaster-related or not.

  3. Tools and Libraries:

    • Python

    • Libraries: pandas, scikit-learn, and TfidfVectorizer for text preprocessing, model training, and evaluation.

    • Jupyter Notebook for implementation.


Project Workflow

  1. Data Preprocessing:

    • Utilized TfidfVectorizer to transform textual features (text, keyword, location) into numerical vectors.

    • Applied feature extraction with a maximum of 1000 features for text and 100 features for keyword and location.

    • Handled missing values and normalized text data.

  2. Feature Engineering:

    • Created a preprocessing pipeline using ColumnTransformer for efficient feature handling.

  3. Model Implementation:

    • Used a Random Forest Classifier with specific hyperparameters:

      • Random State: 42

    • Integrated the preprocessing pipeline with the classifier using scikit-learn's Pipeline.

  4. Cross-Validation:

    • Performed K-fold cross-validation with k=5 to assess model stability and performance.

    • Results:

      • Mean Accuracy: 77.58%

      • Standard Deviation: 1.30%

  5. Evaluation:

    • Evaluated the model on a test set, achieving the following metrics:

      • Accuracy: 78%

      • Precision (Class 0): 77%

      • Recall (Class 1): 66%

      • F1-Score (Class 1): 72%


Learning Outcomes

  1. Developed a complete pipeline for text preprocessing and model training.

  2. Enhanced understanding of text feature extraction techniques such as TF-IDF.

  3. Learned to integrate cross-validation for robust model evaluation.

  4. Practiced exporting and visualizing prediction results in a Kaggle-style competition.


Next Steps

  • Experiment with advanced models like Gradient Boosting or deep learning (e.g., LSTM, BERT) for improved accuracy.

  • Perform hyperparameter tuning using Grid Search or Bayesian Optimization.

  • Include additional features such as word embeddings for better semantic understanding.

 
 
 

Comentarios


bottom of page