
🚀 Learning Machine Learning from Scratch: Spam Classifier ✉️🤖
Learning machine learning doesn’t have to be complicated. An ideal starter project is to build a spam email classifier using the Enron dataset.
- ✅ Text processing (tokenization, stemming, lemmatization)
- 🔢 Turning words into numbers with TF‑IDF
- 📊 Training a simple model such as Naive Bayes or even an SVM
- 📈 Evaluating with accuracy, precision, recall and F1‑score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import glob
# cargar correos y etiquetas (spam/ham)
emails, labels = [], []
for path in glob.glob("enron/**/*.txt", recursive=True):
with open(path, errors='ignore') as f:
text = f.read()
emails.append(text)
labels.append("spam" if "spam" in path else "ham")
# vectorizar y entrenar
X = TfidfVectorizer(stop_words="english").fit_transform(emails)
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
model = MultinomialNB().fit(X_train, y_train)
print(classification_report(y_test, model.predict(X_test)))
📝 Explanation in a Few Words
Imagine each email is a cooking recipe. First we break the recipe into words (tokenization) and convert them into numbers based on how frequent they are (TF‑IDF). Then an algorithm like Naive Bayes learns which words tend to appear in “spam” emails. After training, you just feed a new email to the model and it will tell you whether it’s junk or not. ✅
This approach teaches you key NLP and ML concepts in a practical way, and in no time you’ll have your first classifier working! 💡✨
More information at the link 👇
Also published on LinkedIn.
