让我重新分析这个问题:
- 需要考虑的HTML情况:
- 简单标签:
<div>text</div>
- 带属性标签:
<a href="url">text</a>
- 自闭合标签:
<br/>
- 带属性的自闭合标签:
<img src="img.jpg"/>
- 需要排除的markdown链接格式:
- 专指
<https://...>这种格式
- 特征是 < 后紧跟 http 或 https
- 当前正则的问题:
<\s*([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*([a-zA-Z]+)[^>]*/>
它会匹配所有以 < 开头,包含字母的内容,这就导致了误匹配
- 改进思路:
- 排除标签名以
http(s) 开头的情况
- 使用负向预查
(?!http(s)?)
- 放在标签名匹配的开始处
- 优化后的正则表达式:
<\s*(?!https?)([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*(?!https?)([a-zA-Z]+)[^>]*/>
- 验证效果:
<div>text</div> ✓ 匹配
<a href="url">text</a> ✓ 匹配
<br/> ✓ 匹配
<img src="img.jpg"/> ✓ 匹配
<https://abc.com/> ✗ 不匹配
这个解决方案:
- 保留了原有正则的基本结构
- 只添加了对标签名的限制条件
- 能正确处理各种HTML标签
- 成功排除markdown格式的URL
建议修改代码为:
public static bool IsHtml(this string input)
{
return Regex.IsMatch(input,
@"<\s*(?!https?)([a-zA-Z]+)[^>]*>.*</\s*\1\s*>|<\s*(?!https?)([a-zA-Z]+)[^>]*/>",
RegexOptions.Singleline | RegexOptions.IgnoreCase);
}
I found this brilliant answer at Stack overflow. It is way more clearer and useful than Microsoft's official one
You can use inline modifiers as follows:
// case insensitive match
Regex MyRegex = new Regex(@"(?i)[a-z]+"); // case insensitive match
or, inverse the meaning of the modifier by adding a minus-sign:
// case sensitive match
Regex MyRegex = new Regex(@"(?-i)[a-z]+"); // case sensitive match
or, switch them on and off:
// case sensitive, then case-insensitive match
Regex MyRegex = new Regex(@"(?-i)[a-z]+(?i)[k-n]+");
Alternatively, you can use the mode-modifier span syntax using a colon : and a grouping parenthesis, which scopes the modifier to only that group:
// case sensitive, then case-insensitive match
Regex MyRegex = new Regex(@"(?-i:[a-z]+)(?i:[k-n]+)");
You can use multiple modifiers in one go like this (?is-m:text), or after another, if you find that clearer (?i)(?s)(?-m)text (I don't). When you use the on/off switching syntax, be aware that the modifier works till the next switch, or the end of the regex. Conversely, using the mode-modified spans, after the span the default behavior will apply.
Finally: the allowed modifiers in .NET are (use a minus to invert the mode):
x allow whitespace and comments
s single-line mode
m multi-line mode
i case insensitivity
n only allow explicit capture (.NET specific)
You might already know that Oracle regular expression doesn't support \b. However, we need it. So the answer is
(^|\W)yourstring(\W|$)
Reference