Hayling Island Internet Marketing and Technology Blog
£300CMS
A Content Management System can give you the power to control the content on your web site. Our CMS is a fully featured mature piece of professional software it also comes with a remarkable price tag.
- A custom designed template
- Edit your content
- A web form
Building a PHP Swear Filter
When I first started to write a swear filter, I thought the problem was as simple as matching a words on a banned list with your string, but the more I thought about the problem the more complex the problem became
If I was talking about meat for example would you know what these words are 'p**k', 'be@f', 'B4C0N' I think these are all quite clear what has been written but obviously these words would not match up with 'pork', 'beef' and 'bacon'.
I started to break down different types of constructing words. I thought there were two main types that I would have to cater for:
'P-O-R-K'
And
'P**K'
The first technique I felt using a search and replace would be a good start.
$swearWords = array ('pork', 'beek', 'chicken', 'duck', 'fish');
$stripChars = array ('`', '!', '\'', '"', '£', '$', '%', '^', '&', '*', ' ');
$numberReplace = array('0' => 'o', '1' => 'i', '2' => 'z', '3' => 'e', '4' => 'a', '5' => 's', '6' => 'b', '7' => 'l', '8' => 'ate', '9' => 'g');
$phrase = strtolower($phrase);
$phraseStrip = str_replace($stripChars, '', $phrase);
foreach ($numberReplace as $needle => $haystack)
{
$phraseStrip = str_replace ($needle, $haystack, $phraseStrip);
}
I started to build my code $swearWords was my array of words I would not want to show on my site, obviously this could come from a database or a flat file. $stripChars is an array that holds my characters that I want to strip from my word or phrase, although here I have not used ever character available. I have also included another array with a number replace so I can handle words like 'B4C0N'.
Next I wanted to convert everything to lowercase to give an even playing field, and strip out the charcters I did not require. I used a simple foreach loop to go through my numbers and replace them for equivilent characters.
$isSwear = false;
foreach ($swearWords as $swearWord)
{
$pos = strpos($phraseStrip, $swearWord);
if ($pos !== false) {
echo "The word '$swearWord' was found in the string
";
$isSwear = true;
}
}
I added a variable to store whether the phrase or word is a swear word. Now I had one string of characters and all I had to do is cycle through all my stored banned words and check to see if the word was in the string.
The next part was a little more tricky, I needed to find matches for 'P**K' and 'P$%K'. I thought the best way to accomplish this was to use a simple regular expression. It seemed like the best decision was to break the string into individual words rather than compact the phrase into one long string, as to avoid any unwanted matches.
$phraseArray = explode(' ', $phrase);
foreach ($phraseArray as $word)
{
// need to find a regexp to match f**k
$wordStrip = str_replace($stripChars, '*', $word);
$wordStrip = '('.str_replace('*', '[a-z]{1}', $wordStrip).')';
foreach ($swearWords as $swearWord)
{
if (preg_match($wordStrip, $swearWord))
{
//echo 'Found "'.$swearWord.'"
';
$isSwear = true;
}
}
}
Here I have exploded the array on spaces and begun to loop through all of the words. $wordStrip replaces all of the characters from the $stripChars array with * as a temporary measure and then all the * are replaced with [a-z]{1} which is one character between A and Z. There are two steps because in my final array of characters I will want to remove I will include '[' and this will cause havoc when I am also adding this character to the string.
Now my regular expression is set up for the word, again all I have to do is loop through my array to check for any matches using preg_match.
Below is the whole function and a simple page.
function swearFilter($phrase)
{
$isSwear = false;
$swearWords = array ('turkey', 'duck', 'chicken', 'beef', 'pork');
$stripChars = array ('`', '!', '\'', '"', '£', '$', '%', '^', '&', '*', '(', ')', '_', '_',
'+', '=', '|', '\\', ',', '<', '.', '>', '?', '/', ':', ';', '@', '#', '~', '{', '[', '}', ']', ' ');
$numberReplace = array('0' => 'o', '1' => 'i', '2' => 'z', '3' => 'e',
'4' => 'a', '5' => 's', '6' => 'b', '7' => 'l', '8' => 'ate', '9' => 'g');
$phrase = strtolower($phrase);
$phraseStrip = str_replace($stripChars, '', $phrase);
foreach ($numberReplace as $needle => $haystack)
{
$phraseStrip = str_replace ($needle, $haystack, $phraseStrip);
}
foreach ($swearWords as $swearWord)
{
$pos = strpos($phraseStrip, $swearWord);
if ($pos !== false) {
//echo "The word '$swearWord' was found in the string
";
$isSwear = true;
}
}
$phraseArray = explode(' ', $phrase);
foreach ($phraseArray as $word)
{
// need to find a regexp to match f**k
$wordStrip = str_replace($stripChars, '*', $word);
$wordStrip = '('.str_replace('*', '[a-z]{1}', $wordStrip).')';
foreach ($swearWords as $swearWord)
{
if (preg_match($wordStrip, $swearWord))
{
//echo 'Found "'.$swearWord.'"
';
$isSwear = true;
}
}
}
if ($isSwear == true)
{
return true;
}
else
{
return false;
}
}