I’ve been scouring Google, searching for a regular expression that can help me filter out email signatures from email text. I guess email body processing is kind of niche for NLP as I found alot of email address parsing regex but nothing related to email signature parsing. So I made one and wanted to share it with anyone who might benefit from it.
This regex code removes everything after the email ending. When you end the email with a “Cheers” or Sincerely, that phrase and everything following it, will be matched.
(\w*\s)?([B|b]est|[R|r]egards|Have a|[C|c]heers|[S|s]incerely|[T|t]ake care|Looking forward|Fond|Kind|Yours)(\s*.*)
Take this piece of regex and paste it to this regex tester website and test it out on your email text: https://www.regextester.com/
Then the Python Regular expression module can be used to substitute the matched text with an empty string. Some thing like this, but don’t forget to import re
nosig=re.sub(r'(\w*\s)?([B|b]est|[R|r]egards|Have a|[C|c]heers|[S|s]incerely|[T|t]ake care|Looking forward|Fond|Kind|Yours)(\s*.*)’,”, msg)
Be sure to remove newline characters ‘\n’ (by using the replace method or something else) before proceeding with this regular expression as newline characters, for some reason, confuse Python
Feel free to contact me with any feedback on how to improve this regex or if it was useful to you, I would love to hear about it!
References
- Common email endings: https://www.thebalancecareers.com/email-message-closing-examples-2061895
- The starter code that I built my regex on top of: https://stackoverflow.com/questions/14654832/regex-to-match-warm-regards-type-email-signatures
- Removing newline character: https://stackoverflow.com/questions/64149406/python-regex-works-in-regex-tester-but-not-in-practical