User Input. What modern app doesn't require some form of this? It's great if you can provide a range of inputs, like with a select box or radio buttons. However, a lot of times, you, as the developer need to allow free flowing text. In this case I've never known this data not to go into a database (the one exception would be a search box input -- unless you are capturing what the user is searching for -- but that's another post to compare and contrast Google Search versus Duck Duck Go).

So, how do you make sure there is no maliciousness going on? It's not uncommon to throw a SQL statement in an input field, or tacked on the end of a URL and see what happens! Last thing you want as a developer is to have your database wiped out with a ?name=drop%20*%20FROM%20ALL. Hope that backup tape is available or that it's stored in S3 rather than Glacier!

So, let's clean that input up. Let's Sanitize it.

Many, if not all frameworks/languages have some function/class/method/add-on that will take a pretty good stab at it and protect you from the real easy bad code (xss, sql injection, etc.). In my experience the frameworks (e.g. Laravel) have a better solution for this than the languages (e.g. PHP). Pretty sure that JavaScript doesn't have anything out of the box (please correct me in the comments below).

As a good Toolsmith, one tends to write a lot of one-off code snippets that take care of an issue and is then discarded. Sometimes it's a Band-Aid that needs to be run every once and awhile. These don't need to be clean. Don't need to be optimized. They just need to get the job done. ASAP. So, that's when I really don't like the overhead of a framework. Much rather pound out some PHP that interacts with the DB and quits. Still need to sanitize what is pulled from the DB before it's manipulated and put back into a (different sometimes) DB.

Other Band-Aids that I've written were needed by the dev team from time to time and would need an input field that would need to be sanitized. These times I still didn't want the overhead of a framework, but needed to protect the data.

Sorry, that was a bit long winded way to say RegEx to the rescue. In fact JS has a pretty common function, sanitizeString(str). But you know what's inside that function? That's right, a RegEx. Specifically,

function sanitizeString(str){
    str = str.replace(/[^a-z0-9áéíóúñü \.,_-]/gim,"");
    return str.trim();
}

What do we have here? First off the RegEx expression is found within the delimiters /. There's a starting forward-slash and an ending forward-slash.

Immediately with those delimiters you'll find an open square-bracket [ some text then the closing square-bracket ]. This is the character group to do your search on.

What's the first character in this group? The caret ^. In this case it means "negate". In english, that'd be "Find everything that's not...." everything that comes after the caret and before the end of the closing square-bracket.

Ok, so what are we accepting (the opposite of negating)? First up is the character range a-z, so, any letter of the alphabet (notice it's just lower case?). Next is any digit 0-9; we'll accept those. Now we have some common ascii crylic (?) characters áéíóúñü. We'll accept those in the input string as well. Finally, let's accept some special characters. First is the period .. But the . has a special meaning in RegEx: It means 'everything', which is exactly what we are trying not to do. So to make this the literal period, we need to escape it, thus the backward-slash \.. Ok, to summarize to this point: Accept all letters, even a few non-English standard ones, all digits and the period.

Next will be the comma ,, the underscore _, and finally the hyphen/dash -. It's important to put the hyphen at the end so that it is not confused for a 'range separator' (like in the digits 0-9 which says "zero through nine" rather than "zero, hyphen, nine" which excludes 1 through 8. Get it?).

That's the RegEx expression to search for. We now close out the expression with the closing square-bracket ], then the RegEx itself with the forward-slash /. Finally we round out the expression with some optional (but not is this case) flags gim. g=global, that is search the entire input string str and don't stop at the first match. i=ignore, as in case, therefore a is the same as A in the str, and the last flag is m=multiline (from dev mozilla: "treat beginning and end characters (^ and $) as working over multiple lines, i.e., match the beginning or end of each line, delimited by \n or \r, not only the very beginning or end of the whole input string". Whew, but couldn't have typed it better myself.).

Now, you'll see another comma followed by open and closing double-quotes ,"". This is the second part of the replace function. The first is what to search for and match, and the second what to replace it with, in this case an empty string. Why empty? Recall that we are trying to sanitize, or get rid of any bad characters. Notice that the question-mark ? or the asterisk * aren't in the RegEx? That's because those are special characters in the SQL language. So, this RegEx says, if you see a ? in the input str, just replace it with "", which effectively just removes it.

Finally, finally, finally, this JS function returns the sanitized string trimmed trim() of any leading or trailing whitespace.

PHP Customization

Ok, that JS function works pretty well. I've developed my own function to run user input against when I'm writing PHP toolsmithing scripts. What I've done is make the RegEx a variable. Sometimes I want to sanitize a name, other times an email or zip code. So in addition to passing in the string to be tested/sanitized, I pass in the RegEx I want to use.

In this function, if the input value $input_value has any bad characters, as defined by the RegEx $reg_exp, the function returns false, thus failing the test and an error message can be thrown.

function test_datatype($input_value,$reg_exp) {
      if(preg_match($reg_exp,$input_value)) {
         return false;
      }
      return true;
  }

Let's give a real-world example:

$firstNameOK = strlen($_POST['first_name']) > 0 && test_datatype($_POST['first_name'],"/[^A-Za-z0-9 !\-']/") ? $_POST['first_name'] :  $_POST['error-first_name'] = '<error>Please provide your First name.';

This first makes sure that there is some input, thus the strlen() > 0. If not, it fails and the error is given as the value of the variable $firstNameOK. If the string length is greater than one, let's test it against the provided RegEx /[^A-Za-z0-9 !\-']/. So, quickly as review, the forward-slashes denote the RegEx expression, and the square-brackets denote the character group to match/search for in the input text. So, again, we are negating anything that's not a character in the English language A-Za-z (this is an alternate way to get both upper and lower case without the use of a flag). We also will allow any digit 0-9 (which alternately be written as \d which is an escaped 'd' meaning not the letter d, but 'digit'), then there's a literal space, which means look for a space, which is ok, but we don't want /w which would also match tabs, returns, etc.

The next part is new: !\-. The exclamation-point here means 'not' whatever comes next, but not like the ^ which means 'negate'. This means that we don't want the (escaped) hyphen if it doesn't follow a letter or digit (what precedes the !-mark in this character group []. In plain English, this allows a hyphenated name like 'Smith-Jones'.

If the input string fails anywhere it fails totally, unlike the JS function above that just removes the offending character.

$stateOK = strlen($_POST['state']) >0 && test_datatype($_POST['state'],"/[^A-Za-z]/") ? $_POST['state'] : $_POST['error-state'] = '<error>Please select the state of residency.';

Here we only want a state that has letters.

$zipOK = strlen($_POST['zip']) > 0 && test_datatype($_POST['zip'],"/[^0-9 \-]/") ? $_POST['zip']: $_POST["error-zip"] = '<error>Please provide an alpha-numeric input for your Zip/Postal Code';

Here we only want digits and hyphens.

Hopefully you can see the benefit of passing a customized, specific RegEx is when you don't have the benefit of any frameworks nor libraries to test data-input.

Bonus

ES6 Javascript.

function sanitizeString(str){
    str = str.replace(/[^a-z0-9áéíóúñü \.,_-]/gim,"");
    return str.trim();
}

Let's convert the above sanitizeString(str) JS function to ES6.

let safeStr = (str) => str.replace(/[^a-z0-9áéíóúñü \.,_-]/gim,"");

A bit more concise, no? 'Let' the variable safetStr equal the result of a function, denoted by the fat-arrow => with the input of (str), and return the str.replace() method.