Posts in category “Programming”

一次静默的“假死”:当后台任务在我们眼皮底下悄然停止

在软件工程中,我们最害怕的不是那些会产生堆栈跟踪、让系统崩溃的“喧闹” Bug,而是那些“沉默”的刺客。它们悄无声息地让你的系统功能失灵,却不留下一丝痕迹——没有错误日志,没有CPU飙升,甚至健康检查也一路绿灯。

最近,我们就遇到了这样一个“完美罪犯”。

案发现场

我们有一个基于 .NET Core 的后台服务,它作为 IHostedService 运行,负责从 AWS SQS 队列中持续拉取消息并进行处理。在一次常规的依赖库升级后,这个服务表现出了诡异的行为:

服务启动后,它能成功处理第一批消息。然后,就“死”了。

它不再从队列中拉取任何新消息,但容器依然在运行,健康检查接口返回 200 OK。最令人困惑的是,日志面板一片寂静,没有任何异常或警告。服务就像一个进入了深度睡眠的活死人。

迷雾重重的调查

面对这种“静默假死”,我们团队立刻召集了“案情分析会”,并列出了一系列合理的“嫌疑人”:

  1. API 限流 (Throttling):我们刚刚重构了 QueueService,移除了队列 URL 的缓存。是不是因为每次轮询都去调用 GetQueueUrl,导致被 AWS API 限流了?
  2. 网络阻塞/死锁:新的 AWS SDK 行为可能有所不同。是不是因为长轮询在网络抖动时被永久挂起,而我们又没有传递 CancellationToken 导致无法取消?
  3. 高频失败循环 (Tight Error Loop):是不是某个地方持续抛出异常,catch 块虽然捕获了它,但没有设计退避策略,导致后台线程在高速空转,把日志系统拖垮了?

这些都是非常合理的推断,每一个都可能导致我们看到的现象。我们花了大量时间去审查代码、分析理论,甚至准备好了复杂的修复方案,比如重新实现带 SemaphoreSlim 的并发缓存、添加指数退避逻辑等。

然而,我们所有的推断都错了。

真相大白:一个null引发的血案

真正的罪魁祸首,隐藏在一个我们意想不到的地方,其貌不扬,甚至有些可笑。它不是复杂的云服务交互问题,而是一个基础的 C# 空引用异常。

在我们的 QueueProcessorService 中,有这样一段逻辑:

// _StartQueueProcessingAsync() in QueueProcessorService
List<Message> messageList = new List<Message>();
try
{
    // 调用重构后的 QueueService
    messageList = await _queueService.ReceiveMessageAsync(request);
}
catch (Exception e)
{
    _logger.Error(e, e.Message);
}
finally
{
    // 如果 messageList 里有消息,就去处理
    if (messageList.Any()) // <-- 致命的一行
    {
        var processingTasks = messageList.Select(ProcessMessageAsync).ToArray();
        await Task.WhenAll(processingTasks);
    }
    else
    {
        await Task.Delay(500);
    }
}

问题出在哪里?

在我们升级 AWSSDK.SQS 库之后,_amazonSqs.ReceiveMessageAsync() 的行为发生了一个微小但致命的破坏性变更:当队列为空时,返回的 ReceiveMessageResponse 对象中的 Messages 属性不再是一个空列表 [],而是 null

我们的 QueueService 在修复前,直接将这个 null 返回给了调用者。于是,在 QueueProcessorService 中,messageList 变量在队列为空时被赋值为 null

接下来,程序进入 finally 块,执行 messageList.Any()。在一个 null 对象上调用任何实例方法,结果只有一个:NullReferenceException

帮凶:被“遗忘”的后台任务

一个 NullReferenceException 足以致命,但为什么它能做到悄无声息?这就引出了本案的“帮凶”——我们启动后台任务的方式。

IHostedServiceStartAsync 方法中,我们这样启动了主循环:

public Task StartAsync(CancellationToken cancellationToken)
{
    _logger.Info("Service is running...");
    // “即发即忘”式启动
    _queueProcessingTask = _StartQueueProcessingAsync();
    return Task.CompletedTask;
}

这种“即发即忘”(Fire-and-Forget)的模式有一个巨大的隐患:如果 _queueProcessingTask 在未来的某个时刻因为一个未处理的异常而失败(Faulted),这个异常不会被传播,它会被静默地“吞噬”掉

我们的 NullReferenceException 正好发生在一个没有任何 try-catch 保护的 finally 块中,它成为了一个未处理异常,直接杀死了 _StartQueueProcessingAsync 任务。而我们程序的其他部分对此一无所知,继续假装一切正常。

我们学到的教训

这次艰难的排错过程给我们留下了几个深刻的教训:

  1. 警惕第三方库的“微小”变更:永远不要想当然地认为依赖库的次要版本升级是完全无害的。null[] 的区别,足以让一个健壮的系统瞬间瘫痪。仔细阅读更新日志(Changelog)至关重要。

  2. 奉行防御性编程:永远不要完全信任方法的返回值。对于任何可能返回集合的方法,都应该做好它返回 null 的准备。一个简单的 ?? [] 就能拯救世界。

    // 修复方案
    var response = await _amazonSqs.ReceiveMessageAsync(request, cancellationToken);
    return response.Messages ?? []; // 永远返回一个有效的列表
    
  3. 永远不要“遗忘”你的后台任务:对于“即发即忘”的后台任务,必须建立一个“观察哨”。最简单的方式是在启动它的地方包裹一个 try-catch,确保任何致命异常都能被记录下来。

    // 更健壮的启动方式
    public Task StartAsync(CancellationToken cancellationToken)
    {
        _cancellationTokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken);
        _queueProcessingTask = Task.Run(async () =>
        {
            try
            {
                await _StartQueueProcessingAsync(_cancellationTokenSource.Token);
            }
            catch (Exception ex)
            {
                // 记录致命错误,这会让问题立刻暴露
                _logger.Fatal(ex, "The queue processing task has crashed unexpectedly.");
            }
        }, _cancellationTokenSource.Token);
    
        return Task.CompletedTask;
    }
    

这次经历提醒我们,最危险的 Bug 往往不是那些复杂的算法或架构问题,而是由一连串微小的疏忽和意外共同造成的。保持敬畏,编写健壮、可预测的代码,才是我们对抗这些“沉默刺客”的最好武器。

Desktop Mouse Swipe Delete Troubleshooting

Problem

Mouse left swipe to delete notes was not working on desktop platforms (Windows, macOS, Linux, Web browsers).

Root Cause Analysis

  1. Flutter Dismissible Limitations: Dismissible widget is optimized for touch interactions, not mouse gestures
  2. SelectionArea Gesture Conflict: SelectionArea intercepted mouse drag events for text selection, blocking Dismissible swipe gestures

Solution

Initial Approach (Failed)

Attempted platform-specific custom GestureDetector implementation:

  • Added desktop platform detection logic
  • Implemented custom mouse drag threshold calculations
  • Issue: GestureDetector only detects gestures but provides no visual feedback

Final Solution (Working)

Two-part fix for Dismissible widget:

  1. Immediate Mouse Response
Dismissible(
  dragStartBehavior: DragStartBehavior.down, // Key fix
  // ... other properties
)
  1. Gesture Conflict Resolution
// Before: SelectionArea wrapping entire content
SelectionArea(
  child: GestureDetector(...) // Blocked swipe gestures
)

// After: SelectionArea only around text content
GestureDetector(
  child: Column([
    metadata,
    SelectionArea(child: noteContent), // Limited scope
    footer,
  ])
)

Technical Details

dragStartBehavior Impact

  • DragStartBehavior.start (default): Waits for drag distance threshold
  • DragStartBehavior.down: Starts drag immediately on mouse down
  • Desktop users expect immediate response to mouse actions

SelectionArea Scope Reduction

  • Problem: Full-content SelectionArea captured all mouse events
  • Solution: Limit SelectionArea to text content only
  • Result: Swipe gestures work on margins, text selection works on content

Testing Strategy

Created comprehensive test suite covering:

  • Dismissible configuration validation
  • Platform-specific behavior
  • Gesture conflict scenarios
  • Integration with existing callbacks

Key Learnings

  1. Flutter's Dismissible supports desktop with proper configuration
  2. Widget event hierarchies can cause unexpected gesture conflicts
  3. Scope reduction often beats complex custom implementations
  4. Platform-specific UX requires careful gesture management

Code Impact

  • Files Modified: note_list_item.dart
  • Tests Added: note_list_item_test.dart
  • Lines Changed: 30 additions, 15 deletions
  • Breaking Changes: None

Verification

  • All tests pass
  • Code analysis clean
  • Desktop mouse swipe delete functional
  • Text selection preserved

Flutter iOS Safari Double Selection Issue - Technical Workaround

Problem

Flutter web apps on iOS Safari exhibit a "double selection" bug where text selection creates two overlapping selection layers, causing visual artifacts and interaction issues.

Root Cause

iOS Safari creates both native browser text selection AND Flutter's custom SelectionArea selection simultaneously, resulting in conflicting selection states.

Working Workaround

HTML Solution (web/index.html)

<head>
<style>
  * {
	-webkit-user-select: none;
	-moz-user-select: none;
	-ms-user-select: none;
	user-select: none;
	/* Disable caret to prevent selection artifacts */
	caret-color: rgba(255, 255, 255, 0) !important;
  }
</style>
</head>
<body oncontextmenu="event.preventDefault();" >
...

Flutter Integration

// In main.dart or app initialization
import 'package:flutter/gestures.dart';

void main() {
    // Disable browser context menu to let Flutter handle selection
    if (kIsWeb) {
        BrowserContextMenu.disableContextMenu();
    }
    runApp(MyApp());
}

How It Works

  1. Disables native selection - user-select: none prevents Safari's text selection
  2. Blocks touch events - Prevents iOS touch selection gestures
  3. Maintains Flutter selection - BrowserContextMenu.disableContextMenu() allows Flutter's SelectionArea to work
  4. Hides caret artifacts - caret-color: transparent eliminates visual glitches

Trade-offs

  • ❌ Breaks native web text selection outside Flutter widgets
  • ❌ May affect accessibility tools
  • ✅ Provides consistent cross-platform selection behavior
  • ✅ Eliminates iOS-specific double selection bug

Status

This workaround is recommended for Flutter web apps requiring text selection on iOS Safari until the official framework fix is released.

Solving Flutter Web Image Rendering Issues: A Cross-Platform Approach

If you've ever tried to display images in a Flutter app that needs to work seamlessly across web and mobile platforms, you've probably run into some frustrating limitations. Recently, I tackled this exact problem when our markdown image renderer started choking on web deployments due to CORS restrictions and lack of proper zoom functionality.

The Problem

Our original implementation was painfully simple - just Image.network(src) wrapped in a GestureDetector. This worked fine on mobile, or on web with --web-renderer html. However, the HTML web renderer is obsolete and will be removed shortly. We needed a more reliable solution.

The Solution: Platform-Specific Implementations

The key insight was to leverage Flutter's conditional imports to create platform-specific implementations while maintaining a clean, unified API.

Setting Up Conditional Imports

// Conditional imports for web
import 'web_image_stub.dart'
    if (dart.library.html) 'web_image_impl.dart';

This pattern lets you have different implementations for web vs mobile while keeping your main code clean. The stub file handles non-web platforms, while the implementation file contains the web-specific logic.

Web Implementation: HTML Elements to the Rescue

For web, I ditched Flutter's built-in image widgets entirely and went straight to HTML elements using HtmlElementView. This bypasses CORS issues since the browser handles the image loading directly.

final imgElement = html.ImageElement()
  ..src = src
  ..style.width = '100%'
  ..style.objectFit = 'contain'
  ..style.cursor = 'pointer';

The magic happens when you register this as a platform view. Flutter treats it like any other widget, but under the hood, it's pure HTML - which means it plays nicely with browser security policies.

Adding Zoom Functionality

The fullscreen implementation includes both mouse wheel and touch gesture zoom:

// Mouse wheel zoom
imgElement.onWheel.listen((event) {
  event.preventDefault();
  scale += event.deltaY > 0 ? -0.1 : 0.1;
  scale = scale.clamp(0.5, 3.0);
  imgElement.style.transform = 'scale($scale)';
});

For touch devices, I implemented pinch-to-zoom by tracking multiple touch points and calculating the distance between them. It's more complex than the mouse wheel version, but it gives web users the same intuitive zoom experience they expect.

Mobile Implementation: Keep It Simple

For mobile platforms, I stuck with the tried-and-true approach but improved the UX:

Dialog.fullscreen(
  backgroundColor: Colors.black,
  child: Stack(
    children: [
      Center(
        child: PhotoView(
          imageProvider: NetworkImage(url),
          minScale: PhotoViewComputedScale.contained,
          maxScale: PhotoViewComputedScale.covered * 4,
        ),
      ),
      // Close button positioned in top-right
    ],
  ),
)

The key improvements were switching to Dialog.fullscreen instead of a regular dialog and adding a proper close button with consistent styling.

Key Takeaways

  1. Conditional imports are your friend - They let you maintain clean separation between platform-specific code without cluttering your main logic.

  2. HTML elements can solve web-specific problems - When Flutter widgets don't cut it on web, dropping down to HTML often provides better browser compatibility.

  3. Consistent UX matters - Users expect zoom functionality on images, especially in fullscreen mode. Don't skimp on these details.

  4. Don't fight the platform - Web and mobile have different strengths. Embrace them instead of trying to force a one-size-fits-all solution.

The Result

After implementing these changes, our image handling works consistently across platforms. Web users get smooth zoom functionality without CORS headaches, mobile users get the native experience they expect, and the codebase remains maintainable with clear separation of concerns.

Sometimes the best solution isn't the most elegant one - it's the one that actually works for your users across all the platforms they're using.

Introducing changesummary: A Git Change Summary Script

The changesummary script is a powerful tool for developers to quickly understand the key changes made between two Git commits. It leverages AI to analyze the diff and provide a concise summary of the modifications.

Functionality

The script takes one or two arguments: the start commit hash and an optional end commit hash. If the end commit hash is not provided, it defaults to HEAD.

Usage

To use changesummary, simply run it in your Git Bash terminal:

./changesummary <start_commit_hash> [<end_commit_hash>]

Example

./changesummary abc123 def456

Result:

The key changes between the specified commit hashes are:

* Renamed `TrmWithRiaSettlementPricingTask` to `TrmWithRiaDetailsPricingTask` and refactored its logic into a base class `TrmConfigPricingTaskBase`.
* Removed `RiaRawRateDataDecorator` and updated `TrmWithRiaDetailsPricingTask` to use `RiaRate` instead of `RiaRawRate`.
* Updated the `RequiredData` enum to remove `RiaRawRate` and updated the `TrmWithRiaDetailsPricingTask` to require `RiaRate`.
* Updated error codes and messages to reflect the changes.

These changes simplify the pricing task logic and improve maintainability.

Benefits

  • Provides a quick and meaningful summary of changes, saving time during code reviews.
  • Helps in understanding the impact of changes made between commits.
  • Easy to integrate into existing Git workflows.
  • Flexible comparison range with optional end commit hash.

By using changesummary, developers can streamline their code review process and focus on the most important changes.

here's the source code:

#!/bin/bash

# Check if commit hash is provided
if [ -z "$1" ]; then
    echo "Error: Commit hash is required as an argument."
    exit 1
fi

convert_to_uppercase() {
    # Enable case-insensitive matching
    shopt -s nocasematch
    
    if [[ $1 =~ ^(HEAD|FETCH_HEAD|ORIG_HEAD|MERGE_HEAD)(\~[0-9]+|\^[0-9]*)* ]]; then
        # Convert to uppercase
        echo "${1^^}"
    else
        # Return original string
        echo "$1"
    fi
    
    # Reset case-sensitivity to default
    shopt -u nocasematch
}

start_hash=$(convert_to_uppercase "$1")
end_hash=$(convert_to_uppercase "${2:-HEAD}")

# Define static prompt text
static_prompt=$(cat <<-END
Analyze the following code diff. Generate a concise summary (under 100 words) of the **key changes** made between the specified commit hashes. Present the changes in a bullet-point list format, focusing on the main modifications and their impact.
Code changes:
END
)

# Define model and system message variables
model="meta-llama/llama-4-maverick:free"
system_message="You are a programmer"

# Execute git diff and pipe its output to the AI model
git diff -w -b $start_hash..$end_hash | jq -R -s --arg model "$model" --arg system_content "$system_message" --arg static_prompt "$static_prompt" \
    '{
        model: $model,
        messages: [
            {role: "system", content: $system_content},
            {role: "user", content: ($static_prompt + .) }
        ],
        max_tokens: 16384,
        temperature: 0
    }' | curl -s --request POST \
        --url https://openrouter.ai/api/v1/chat/completions \
        --header "Authorization: Bearer $OR_FOR_CI_API_KEY" \
        --header "Content-Type: application/json" \
        --data-binary @- | jq -r '.choices[0].message.content'