Posts tagged with “regex”

Claude.AI 帮我改进 IsHtml 判断

让我重新分析这个问题:

需要考虑的HTML情况:

简单标签: <div>text</div>
带属性标签: <a href="url">text</a>
自闭合标签: <br/>
带属性的自闭合标签: <img src="img.jpg"/>

需要排除的markdown链接格式:

专指<https://...>这种格式
特征是 < 后紧跟 http 或 https

当前正则的问题:

<\s*([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*([a-zA-Z]+)[^>]*/>

它会匹配所有以 < 开头，包含字母的内容，这就导致了误匹配

改进思路:

排除标签名以 http(s) 开头的情况
使用负向预查 (?!http(s)?)
放在标签名匹配的开始处

优化后的正则表达式:

<\s*(?!https?)([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*(?!https?)([a-zA-Z]+)[^>]*/>

验证效果:

<div>text</div> ✓ 匹配
<a href="url">text</a> ✓ 匹配
<br/> ✓ 匹配
<img src="img.jpg"/> ✓ 匹配
<https://abc.com/> ✗ 不匹配

这个解决方案:

保留了原有正则的基本结构
只添加了对标签名的限制条件
能正确处理各种HTML标签
成功排除markdown格式的URL

建议修改代码为:

public static bool IsHtml(this string input)
{
    return Regex.IsMatch(input, 
        @"<\s*(?!https?)([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*(?!https?)([a-zA-Z]+)[^>]*/>", 
        RegexOptions.Singleline | RegexOptions.IgnoreCase);
}

C# inline regular expression syntax and usage

I found this brilliant answer at Stack overflow. It is way more clearer and useful than Microsoft's official one

You can use inline modifiers as follows:

// case insensitive match
Regex MyRegex = new Regex(@"(?i)[a-z]+");  // case insensitive match

or, inverse the meaning of the modifier by adding a minus-sign:

// case sensitive match
Regex MyRegex = new Regex(@"(?-i)[a-z]+");  // case sensitive match

or, switch them on and off:

// case sensitive, then case-insensitive match
Regex MyRegex = new Regex(@"(?-i)[a-z]+(?i)[k-n]+");

Alternatively, you can use the mode-modifier span syntax using a colon : and a grouping parenthesis, which scopes the modifier to only that group:

// case sensitive, then case-insensitive match
Regex MyRegex = new Regex(@"(?-i:[a-z]+)(?i:[k-n]+)");

You can use multiple modifiers in one go like this (?is-m:text), or after another, if you find that clearer (?i)(?s)(?-m)text (I don't). When you use the on/off switching syntax, be aware that the modifier works till the next switch, or the end of the regex. Conversely, using the mode-modified spans, after the span the default behavior will apply.

Finally: the allowed modifiers in .NET are (use a minus to invert the mode):

x allow whitespace and comments
s single-line mode
m multi-line mode
i case insensitivity
n only allow explicit capture (.NET specific)

Alternative to Oracle Regular expression for word boundaries

You might already know that Oracle regular expression doesn't support \b. However, we need it. So the answer is

(^|\W)yourstring(\W|$)

Reference

Scribbles for my bad memory

Menu

Menu

Categories

Pages

Recent Posts

Archive

Claude.AI 帮我改进 IsHtml 判断

C# inline regular expression syntax and usage

Alternative to Oracle Regular expression for word boundaries