《DeepSeek的興衰史:從中國國產AI新星到「幻覺王」的墜落》

2025-07-10

《DeepSeek的興衰史:從中國國產AI新星到「幻覺王」的墜落》

在2023年2月6日,名為深度求索(DeepSeek)的人工智慧公司正式成立,其所推出的大語言模型因優異的中文理解能力、靈活的本地化部署方案與吸引開發者的免費商用策略,一度被譽為「大陸國產AI黑馬」。特別是在C-Eval等中文語義評測中,DeepSeek甚至超越OpenAI的GPT-3.5,令中國國內外產業圈與開發者社群為之一振。巔峰時期,在2025年自1月11日上線後,僅20天日活用戶突破1000萬,在中文AI工具類應用中更是一度超越百度的「文心一言」,佔據近五成使用率,風頭一時無兩。

然而,這匹黑馬的光芒卻難以持久,致命的「幻覺問題」逐漸浮出水面,最終導致整個系統的信任崩潰。DeepSeek為追求高速回應,將原始的671B參數模型蒸餾壓縮為僅7B,這技術取向雖提升運行效率,但犧牲模型的穩定性與語言理解深度。此外,其中文語料訓練數據摻雜大量劣質網文與虛假資料,並缺乏如GPT-4那般的多層次事實驗證模組,使得生成內容的準確性極其不穩。

在實際應用中,DeepSeek常出現荒謬錯誤。例如將「魯迅的代表作」錯誤回答為《鋼鐵是怎樣煉成的》,甚至虛構出不存在的2023年諾貝爾獎得主「卡爾·林德曼」,更不用說在Python代碼生成功能中,屢次出現語法錯誤與死循環,導致企業客戶大量投訴。調查顯示,其客服應用場景錯誤率高達42%,在醫療領域的幻覺率更達58%,甚至引發法律糾紛。而代碼可用性方面,根據GitHub開發者回饋,僅有27%的生成代碼可實際執行。

由於這些問題長期未獲有效解決,DeepSeek自2025年下半年遭遇全面性滑鐵盧。開發者社群對其信任全面崩潰,其在Hugging Face上的評價從4.1星開始暴跌。使用者調查顯示,有67%的用戶選擇棄用,主因為「不敢信答案」。截至2025年下半年,其總使用率已跌至3%。

對比其他主流AI模型,DeepSeek的問題更顯嚴重。GPT-4結合多模態事實核查與人類回饋強化學習(RLHF),將中文幻覺率控制在12%;百度文心一言運用知識圖譜實時比對,幻覺率為18%;而DeepSeek僅依賴原始語料,幻覺率高達53%。其最大差距,在於缺乏持續學習與模型更新機制,半年未見基礎模型改版,並忽略專業領域如法律與醫療的垂直優化。此外,其社群管理亦失敗嚴重,GitHub上問題平均回應時間達17天,嚴重削弱開發者對平台的信心。

為自救,DeepSeek企圖推出所謂「事實增強版」,但實際僅以關鍵詞過濾來迴避敏感內容,導致回答中頻繁出現「根據相關政策不予顯示」等模糊措辭,未能實質解決根本問題。雖嘗試引入人類專家審查,但高昂成本導致回應速度驟降六成,且因底層模型缺陷仍在,幻覺問題未見明顯改善。轉型AIGC(生成式內容)工具的策略同樣失敗,用戶普遍反映生成內容邏輯混亂,幾乎被市場邊緣化。

從DeepSeek的失敗經驗中,我們得到三項明確教訓。首先,AI模型並非越快越好,若以犧牲準確性為代價,速度毫無意義。其次,中文市場用戶特別重視事實精準與信任度,對創意性內容的需求反而居次。最後,在快速演進的大模型賽道上,缺乏持續迭代與優化幾乎等同於自我放棄。目前DeepSeek雖已將業務重心轉向B端企業市場,但若無徹底推翻重建的決心,要重新贏回使用者信任幾乎無望。這個案例已成為AI行業的警世教材——技術領先可以一時,但基礎扎實與長線經營,才是真正決定一款AI產品生死的根本。

 

“The Rise and Fall of DeepSeek: From Domestic AI Star to the ‘Hallucination King’”

On February 6, 2023, DeepSeek, an artificial intelligence company based in China, was officially founded. Its debut large language model quickly rose to prominence thanks to its strong Chinese language understanding, flexible local deployment options, and developer-friendly free commercial licensing strategy. It was hailed as a dark horse among domestic AI contenders. In benchmark tests such as C-Eval, DeepSeek even outperformed OpenAI’s GPT-3.5, astonishing industry insiders and developer communities both in China and abroad. At its peak, following its public release on January 11, 2025, the model reached over 10 million daily active users within just 20 days, surpassing Baidu’s “Ernie Bot” and capturing nearly 50% of the Chinese AI tool market—a truly dominant moment.

However, this meteoric rise proved unsustainable. A fatal issue with hallucinations began to surface, leading to a collapse in user trust. In an effort to maximize speed, DeepSeek’s developers compressed the original 671B parameter model down to just 7B through distillation. While this approach boosted response times, it severely compromised model stability and language comprehension. The Chinese training corpus was riddled with low-quality web content and misinformation, and the model lacked a robust fact-checking mechanism akin to GPT-4’s multi-layered verification architecture. As a result, the accuracy of generated responses became highly unreliable.

In real-world applications, DeepSeek frequently delivered absurd answers. For example, when asked about “Lu Xun’s representative works,” it incorrectly cited How the Steel Was Tempered—a novel by Nikolai Ostrovsky. It even fabricated a fictional 2023 Nobel Prize winner named “Karl Lindemann.” In code generation, DeepSeek repeatedly produced Python scripts with syntax errors and infinite loops, triggering a wave of complaints from enterprise clients. Investigations revealed a 42% error rate in customer service scenarios and a hallucination rate of 58% in medical Q&A, which led to several legal disputes. On GitHub, only 27% of its generated code was deemed executable by developers.

These persistent issues remained unresolved, and by the second half of 2025, DeepSeek suffered a dramatic collapse. Trust in developer communities evaporated, and the model’s rating on Hugging Face plunged from 4.1 stars to near irrelevance. User surveys showed that 67% of respondents had abandoned the tool, citing “unreliable answers” as the primary reason. By late 2025, the overall usage rate had plummeted to just 3%.

When compared with other major AI models, DeepSeek’s flaws were glaring. GPT-4 combines multimodal fact verification with reinforcement learning from human feedback (RLHF), keeping Chinese hallucination rates around 12%. Baidu’s Ernie Bot uses real-time knowledge graph validation and achieves an 18% hallucination rate. DeepSeek, however, relied solely on its original corpus, resulting in a 53% hallucination rate. The root problem lay in its lack of continual learning and model updates—its core model hadn’t seen a revision in over six months—and its failure to optimize for professional fields like law and medicine. Additionally, the project’s open-source community management was abysmal, with GitHub issues taking an average of 17 days to receive a response, further eroding developer confidence.

In an attempt to recover, DeepSeek launched a so-called “fact-enhanced version,” but it merely filtered out sensitive keywords. This led to vague, unhelpful responses such as “This content cannot be displayed due to relevant policies,” which failed to address the underlying accuracy problems. Although the company introduced human expert review, the high cost caused response times to drop by 60%, and the hallucination issues persisted due to the flawed base model. A later pivot toward becoming an AIGC (AI-generated content) tool also failed, as users widely reported that the generated content was incoherent and illogical. By this point, DeepSeek had been effectively marginalized from the market.

From DeepSeek’s downfall, three clear lessons emerge. First, AI models are not better simply because they are faster—speed is meaningless if it sacrifices accuracy. Second, users in the Chinese market particularly value factual precision and trustworthiness over creative expression. Third, in the fast-moving world of large language models, failure to continuously iterate and optimize is tantamount to surrender.

Though DeepSeek has since shifted its focus to enterprise (B2B) clients, without the determination to completely rebuild from the ground up, it is unlikely to regain user trust. This case has already become a cautionary tale in the AI industry—technological leadership may bring short-term glory, but it is the strength of foundational engineering and long-term commitment that ultimately determine an AI product’s fate.