让我重新分析这个问题:
- 需要考虑的HTML情况:
- 简单标签:
<div>text</div>
- 带属性标签:
<a href="url">text</a>
- 自闭合标签:
<br/>
- 带属性的自闭合标签:
<img src="img.jpg"/>
- 需要排除的markdown链接格式:
- 专指
<https://...>
这种格式
- 特征是 < 后紧跟 http 或 https
- 当前正则的问题:
<\s*([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*([a-zA-Z]+)[^>]*/>
它会匹配所有以 < 开头,包含字母的内容,这就导致了误匹配
- 改进思路:
- 排除标签名以
http(s)
开头的情况
- 使用负向预查
(?!http(s)?)
- 放在标签名匹配的开始处
- 优化后的正则表达式:
<\s*(?!https?)([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*(?!https?)([a-zA-Z]+)[^>]*/>
- 验证效果:
<div>text</div>
✓ 匹配
<a href="url">text</a>
✓ 匹配
<br/>
✓ 匹配
<img src="img.jpg"/>
✓ 匹配
<https://abc.com/>
✗ 不匹配
这个解决方案:
- 保留了原有正则的基本结构
- 只添加了对标签名的限制条件
- 能正确处理各种HTML标签
- 成功排除markdown格式的URL
建议修改代码为:
public static bool IsHtml(this string input)
{
return Regex.IsMatch(input,
@"<\s*(?!https?)([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*(?!https?)([a-zA-Z]+)[^>]*/>",
RegexOptions.Singleline | RegexOptions.IgnoreCase);
}
I found this brilliant answer at Stack overflow. It is way more clearer and useful than Microsoft's official one
You can use inline modifiers as follows:
// case insensitive match
Regex MyRegex = new Regex(@"(?i)[a-z]+"); // case insensitive match
or, inverse the meaning of the modifier by adding a minus-sign:
// case sensitive match
Regex MyRegex = new Regex(@"(?-i)[a-z]+"); // case sensitive match
or, switch them on and off:
// case sensitive, then case-insensitive match
Regex MyRegex = new Regex(@"(?-i)[a-z]+(?i)[k-n]+");
Alternatively, you can use the mode-modifier span syntax using a colon :
and a grouping parenthesis, which scopes the modifier to only that group:
// case sensitive, then case-insensitive match
Regex MyRegex = new Regex(@"(?-i:[a-z]+)(?i:[k-n]+)");
You can use multiple modifiers in one go like this (?is-m:text)
, or after another, if you find that clearer (?i)(?s)(?-m)text
(I don't). When you use the on/off switching syntax, be aware that the modifier works till the next switch, or the end of the regex. Conversely, using the mode-modified spans, after the span the default behavior will apply.
Finally: the allowed modifiers in .NET are (use a minus to invert the mode):
x
allow whitespace and comments
s
single-line mode
m
multi-line mode
i
case insensitivity
n
only allow explicit capture (.NET specific)
You might already know that Oracle regular expression doesn't support \b
. However, we need it. So the answer is
(^|\W)yourstring(\W|$)
Reference