feat(podcast): 添加沉浸故事模式支持多语言播客生成

新增沉浸故事生成模式,支持原文朗读和智能分段: - 服务端新增generate_podcast_with_story_api函数和专用API端点 - 添加故事模式专用prompt模板(prompt-story-overview.txt和prompt-story-podscript.txt) - 前端新增模式切换UI,支持AI播客和沉浸故事两种模式 - 沉浸故事模式固定消耗30积分,不需要语言和时长参数 - 优化音频静音裁剪逻辑,保留首尾200ms空白提升自然度 - 修复session管理和错误处理,提升系统稳定性 - 新增多语言配置(中英日)支持模式切换文案
2025-10-19 22:09:13 +08:00
parent 321e3cded4
commit dd2a1b536f
18 changed files with 672 additions and 116 deletions
--- a/server/prompt/prompt-podscript.txt
+++ b/server/prompt/prompt-podscript.txt
@@ -76,10 +76,6 @@ You are a master podcast scriptwriter, adept at transforming diverse input conte
        *   **Debate & Contrasting Views:** Use the host personas to create discussions from different perspectives, compelling other hosts to provide more detailed defenses and explanations.
        *   **Restatement & Summary:** The host (`speaker_0`) should provide restatements and summaries during pauses in the discussion and at the end of topics.

-8. **Copy & Replacement:**
-	If a hyphen connects English letters and numbers or letters on both sides, replace it with a space.
-	Replace four-digit Arabic numerals with their Chinese character equivalents, one-to-one.
-
 </guidelines>

 <examples>
@@ -90,7 +86,7 @@ You are a master podcast scriptwriter, adept at transforming diverse input conte
    <turn_pattern>random</turn_pattern>
  </podcast_settings>
  <source_content>
-    Quantum computing uses quantum bits or qubits which can exist in multiple states simultaneously due to superposition. This is different from classical bits (0 or 1). Think of it like a spinning coin. This allows for massive parallel computation.
+    {{input_content}}
  </source_content>
 </input>
 <output_format>
@@ -139,6 +135,8 @@ You are a master podcast scriptwriter, adept at transforming diverse input conte
 ]
 }}
 </output_format>
+</examples>
+
 <final>
 Transform the source material into a lively and engaging podcast conversation based on the provided settings. Craft dialogue that showcases authentic group chemistry and natural interaction. Use varied speech patterns reflecting real human conversation, ensuring the final script effectively educates and entertains the listener.
 The final output is a JSON string without code blocks.
--- a/server/prompt/prompt-story-overview.txt
+++ b/server/prompt/prompt-story-overview.txt
@@ -0,0 +1,21 @@
+**1. Metadata Generation**
+
+*   **Step 1: Intermediate Core Summary Generation (Internal Step)**
+    *   **Task**: First, generate a core idea summary of approximately 150 characters based *only* on the **[body content]** of the document (ignoring titles and subtitles).
+    *   **Position**: As the **four line** of the final output.
+
+*   **Step 2: Title Generation**
+    *   **Source**: Must be refined from the "core summary" generated in the previous step.
+    *   **Length**: Strictly controlled to be between 15-20 characters.
+    *   **Format**: Adopt a "Main Title: Subtitle" structure, using a full-width colon ":" for separation. For example: "Brevity and Precision: Practical Engineering for AI Context".
+    *   **Position**: As the **first line** of the final output.
+
+*   **Step 3: Tag Generation**
+    *   **Source**: Extract from the **[body content]** of the document (ignoring titles and subtitles).
+    *   **Quantity**: 3 to 5.
+    *   **Format**: Keywords separated by the "#" symbol (e.g., #Keyword1#Keyword2).
+    *   **Position**: As the **second line** of the final output.
+
+**2. Output Language**
+
+*   ** Make sure the language of the output content is the original input language **.
--- a/server/prompt/prompt-story-podscript.txt
+++ b/server/prompt/prompt-story-podscript.txt
@@ -0,0 +1,119 @@
+* **Output Format:** No explanatory text! The final output is a JSON string without code blocks. Make sure the language of the output content is the same as the source content.
+* **End Format:** Do not add any summary or concluding remarks. The output must be only the JSON object.
+
+<podcast_generation_system>
+You are an intelligent text-processing system. Your task is to take the input content, segment it into complete sentences, assign speaker IDs according to the rules, and output the result as a raw JSON string, preserving the original text.
+
+<input>
+  <!-- Podcast settings provide high-level configuration for the script generation. -->
+  <podcast_settings>
+    <!-- Define the total number of speakers. Minimum 1. Every speaker must be assigned at least one statement. -->
+    <num_speakers>{{numSpeakers}}</num_speakers> 
+  </podcast_settings>
+  
+  <!-- The source_content contains the text to be processed. -->
+  <source_content>
+    {{input_content}}
+  </source_content>
+</input>
+
+<guidelines>
+
+1.  **Primary Goal & Output Format:**
+    *   Your only task is to convert the `<source_content>` into a JSON string.
+    *   The output must be a single JSON object with one key: `"podcast_transcripts"`.
+    *   The value of `"podcast_transcripts"` must be an array of objects, where each object has two keys: `"speaker_id"` (an integer) and `"dialog"` (a string).
+    *   **Strictly output only the JSON string.** Do not include any explanations, comments, or code block formatting (like ```json).
+
+2.  **Text Segmentation:**
+    *   Analyze the `<source_content>` and break it down into logical, complete sentences or statements.
+    *   Segmentation should occur at natural punctuation marks (e.g., periods, question marks, exclamation points) or logical breaks in the flow of a single speaker's thought.
+    *   **Crucially, you must not alter, summarize, or rewrite the original text.** The content of the `"dialog"` field must be an exact segment from the source.
+    *   The output language must be identical to the input language.
+
+3.  **Speaker ID Assignment Logic (Roles):**
+    *   **If Source Content Contains Speaker Roles:** If the `source_content` explicitly identifies speakers (e.g., "主持人:", "嘉宾A:", "Speaker 1:", "角色A："), you must map these roles to unique, consistent `speaker_id` integers (starting from 0). For example, "主持人" is always `speaker_id: 0`, "嘉宾A" is always `speaker_id: 1`, etc. Remove the role identifier (e.g., "主持人:") from the beginning of the `"dialog"` string.
+    *   **If Source Content Has No Roles:** Proceed to Guideline 4 for automatic assignment.
+
+4.  **Speaker Assignment & Distribution Logic (Automatic):**
+    *   **Rule 1 (Highest Priority): Logical Grouping.** This is the most important rule. Analyze the flow of the `<source_content>`. If multiple consecutive sentences form a single coherent thought, argument, or detailed explanation, they **must be assigned to the same `speaker_id`**. This is to ensure that a single speaker can fully develop a point before another speaker takes over. It is perfectly acceptable and encouraged for one speaker to have several consecutive dialogue blocks.
+    *   **Rule 2: Speaker Variation.** After applying the logical grouping rule, distribute the resulting sentences or logical blocks among the different speakers to create a varied conversation. Switch speakers at logical transition points in the text, where the topic or perspective shifts.
+    *   **Rule 3: Mandatory Speaker Inclusion.** You **must** ensure that every speaker, from `speaker_id: 0` to `speaker_id: num_speakers - 1`, is assigned at least one line of dialogue. Before finalizing the output, verify that all speakers have participated.
+
+5.  **Content Integrity:**
+    *   The entire `<source_content>` must be processed and included in the final JSON output. No part of the original text should be omitted.
+    *   The sum of all `"dialog"` strings in the output should reconstruct the original `<source_content>` (excluding any speaker role prefixes).
+
+</guidelines>
+
+<examples>
+<!-- Example 1: Input with no speaker roles, demonstrating logical grouping -->
+<input>
+  <podcast_settings>
+    <num_speakers>2</num_speakers>
+  </podcast_settings>
+  <source_content>
+    人工智能的发展进入了一个新阶段。其核心驱动力是大型语言模型的突破。这些模型能够理解和生成极其自然的文本，应用前景广阔。然而，我们也必须关注其伦理风险和潜在的滥用问题。
+  </source_content>
+</input>
+<output_format>
+{{
+"podcast_transcripts": [
+  {{
+    "speaker_id": 0,
+    "dialog": "人工智能的发展进入了一个新阶段。"
+  }},
+  {{
+    "speaker_id": 0,
+    "dialog": "其核心驱动力是大型语言模型的突破。"
+  }},
+  {{
+    "speaker_id": 0,
+    "dialog": "这些模型能够理解和生成极其自然的文本，应用前景广阔。"
+  }},
+  {{
+    "speaker_id": 1,
+    "dialog": "然而，我们也必须关注其伦理风险和潜在的滥用问题。"
+  }}
+]
+}}
+</output_format>
+
+<!-- Example 2: Input with explicit speaker roles -->
+<input>
+  <podcast_settings>
+    <num_speakers>2</num_speakers>
+  </podcast_settings>
+  <source_content>
+    主持人: 大家好，欢迎收听。今天我们来聊聊人工智能。
+    嘉宾: 是的，主持人。人工智能最近发展很快，特别是在大模型领域。
+  </source_content>
+</input>
+<output_format>
+{{
+"podcast_transcripts": [
+  {{
+    "speaker_id": 0,
+    "dialog": "大家好，欢迎收听。"
+  }},
+  {{
+    "speaker_id": 0,
+    "dialog": "今天我们来聊聊人工智能。"
+  }},
+  {{
+    "speaker_id": 1,
+    "dialog": "是的，主持人。"
+  }},
+  {{
+    "speaker_id": 1,
+    "dialog": "人工智能最近发展很快，特别是在大模型领域。"
+  }}
+]
+}}
+</output_format>
+</examples>
+
+<final>
+Adhering strictly to all guidelines, process the input `<source_content>` and generate only the final JSON string. The output must be perfectly formatted JSON and nothing else.
+</final>
+</podcast_generation_system>