Skip to main content
  1. Posts/

Learning Machine Learning from Scratch: Spam Classifier

··281 words·2 mins·

🚀 Learning Machine Learning from Scratch: Spam Classifier ✉️🤖

Learning machine learning doesn’t have to be complicated. An ideal starter project is to build a spam email classifier using the Enron dataset.

  • ✅ Text processing (tokenization, stemming, lemmatization)
  • 🔢 Turning words into numbers with TF‑IDF
  • 📊 Training a simple model such as Naive Bayes or even an SVM
  • 📈 Evaluating with accuracy, precision, recall and F1‑score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import glob

# cargar correos y etiquetas (spam/ham)
emails, labels = [], []
for path in glob.glob("enron/**/*.txt", recursive=True):
    with open(path, errors='ignore') as f:
        text = f.read()
    emails.append(text)
    labels.append("spam" if "spam" in path else "ham")

# vectorizar y entrenar
X = TfidfVectorizer(stop_words="english").fit_transform(emails)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
model = MultinomialNB().fit(X_train, y_train)
print(classification_report(y_test, model.predict(X_test)))

📝 Explanation in a Few Words

Imagine each email is a cooking recipe. First we break the recipe into words (tokenization) and convert them into numbers based on how frequent they are (TFIDF). Then an algorithm like Naive Bayes learns which words tend to appear in spam emails. After training, you just feed a new email to the model and it will tell you whether its junk or not. 

This approach teaches you key NLP and ML concepts in a practical way, and in no time youll have your first classifier working! 💡✨



































  
    
      
    
  



  
  

More information at the link 👇

Also published on LinkedIn.
Juan Pedro Bretti Mandarano
Author
Juan Pedro Bretti Mandarano