c# - with - remove invalid xml characters




Escape invalid XML characters in C# (4)

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

void Main() {
    string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    content = RemoveInvalidXmlChars(content);
    Console.WriteLine(IsValidXmlString(content)); // True
}

static string RemoveInvalidXmlChars(string text) {
    var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
    return new string(validXmlChars);
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:

void Main() {
    const string content = "\v\f\0";
    Console.WriteLine(IsValidXmlString(content)); // False

    string encoded = XmlConvert.EncodeName(content);
    Console.WriteLine(IsValidXmlString(encoded)); // True

    string decoded = XmlConvert.DecodeName(encoded);
    Console.WriteLine(content == decoded); // True
}

static bool IsValidXmlString(string text) {
    try {
        XmlConvert.VerifyXmlChars(text);
        return true;
    } catch {
        return false;
    }
}

Update: It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.

I have a string that contains invalid XML characters. How can I escape (or remove) invalid XML characters before I parse the string?


Here is an optimized version of the above method RemoveInvalidXmlChars which doesn't create a new array on every call, thus stressing the GC unnessesarily:

public static string RemoveInvalidXmlChars(string text)
    {
        if (text == null) return text;
        if (text.Length == 0) return text;

        // a bit complicated, but avoids memory usage if not necessary
        StringBuilder result = null;
        for (int i = 0; i < text.Length; i++)
        {
            var ch = text[i];
            if (XmlConvert.IsXmlChar(ch))
            {
                result?.Append(ch);
            }
            else
            {
                if (result == null)
                {
                    result = new StringBuilder();
                    result.Append(text.Substring(0, i));
                }
            }
        }

        if (result == null)
            return text; // no invalid xml chars detected - return original text
        else
            return result.ToString();

    }

The RemoveInvalidXmlChars method provided by Irishman does not support surrogate characters. To test it, use the following example:

static void Main()
{
    const string content = "\v\U00010330";

    string newContent = RemoveInvalidXmlChars(content);

    Console.WriteLine(newContent);
}

This returns an empty string but it shouldn't! It should return "\U00010330" because the character U+10330 is a valid XML character.

To support surrogate characters, I suggest using the following method:

public static string RemoveInvalidXmlChars(string text)
{
    if (string.IsNullOrEmpty(text))
        return text;

    int length = text.Length;
    StringBuilder stringBuilder = new StringBuilder(length);

    for (int i = 0; i < length; ++i)
    {
        if (XmlConvert.IsXmlChar(text[i]))
        {
            stringBuilder.Append(text[i]);
        }
        else if (i + 1 < length && XmlConvert.IsXmlSurrogatePair(text[i + 1], text[i]))
        {
            stringBuilder.Append(text[i]);
            stringBuilder.Append(text[i + 1]);
            ++i;
        }
    }

    return stringBuilder.ToString();
}

Use SecurityElement.Escape

using System;
using System.Security;

class Sample {
  static void Main() {
    string text = "Escape characters : < > & \" \'";
    string xmlText = SecurityElement.Escape(text);
//output:
//Escape characters : &lt; &gt; &amp; &quot; &apos;
    Console.WriteLine(xmlText);
  }
}




escaping