后缀数组<br> - VisuAlgo

Suffix Array is a sorted array of all suffixes of a given (usually long) text string T of length n characters (n can be in order of hundred thousands characters).

Suffix Array is a simple, yet powerful data structure which is used, among others, in full text indices, data compression algorithms, and within the field of bioinformatics.

This data structure is very related to the Suffix Tree data structure. Both data structures are usually studied together.

Remarks: By default, we show e-Lecture Mode for first time (or non logged-in) visitor.
If you are an NUS student and a repeat visitor, please login.

→

🕑

The visualization of Suffix Array is simply a table where each row represents a suffix and each column represents the attributes of the suffixes.

The four (basic) attributes of each row i are:

index i, ranging from 0 to n-1,
SA[i] is the i-th lexicographically smallest suffix of T is the SA[i]-th suffix
The SA values (permutation of [0..n-1]) is the one that we need to compute fast,
LCP[i] is the Longest Common Prefix between the i-th and the (i-1)-th lexicographically smallest suffixes of T is LCP[i]
The LCP values also need to be computed fast and we will see the application of this attribute soon, and
Suffix T[SA[i]:] is the i-th lexicographically smallest suffix of T is from index SA[i] to the end (index n-1)
We do not actually compute these suffixes (it is very slow to do so) and these sorted suffixes are only in this visualization to aid quick understanding of the Suffix Array (and LCP) algorithms.

Some operations may add more attributes to each row and are explained when that operations are discussed.

Pro-tip 1: Since you are not logged-in, you may be a first time visitor (or not an NUS student) who are not aware of the following keyboard shortcuts to navigate this e-Lecture mode: [PageDown]/[PageUp] to go to the next/previous slide, respectively, (and if the drop-down box is highlighted, you can also use [→ or ↓/← or ↑] to do the same),and [Esc] to toggle between this e-Lecture mode and exploration mode.

←

→

🕑

后缀数组的所有可用操作如下所列。

构建后缀数组 (SA) 是基于 Karp，Miller，和 Rosenberg (1972) 的想法的 O(n log n) 后缀数组构建算法，该算法按照增长长度 (1, 2, 4, 8, ...) 对后缀的前缀进行排序。
搜索利用后缀数组中的后缀已排序的事实，并在 O(m log n) 中调用两个二进制搜索，以找到模式字符串 P 的长度 m 的第一次和最后一次出现。
最长公共前缀 (LCP) 可以在 O(n) 中使用排列 LCP (PLCP) 定理计算两个相邻后缀（不包括第一个后缀）之间的最长公共前缀。这个算法的名字叫做 Kasai's 算法。
最长重复子串 (LRS) 是一个简单的 O(n) 算法，它找到具有最高 LCP 值的后缀。
最长公共子串 (LCS) 是一个简单的 O(n) 算法，它找到来自两个不同字符串的具有最高 LCP 值的后缀。

Pro-tip 2: We designed this visualization and this e-Lecture mode to look good on 1366x768 resolution or larger (typical modern laptop resolution in 2021). We recommend using Google Chrome to access VisuAlgo. Go to full screen mode (F11) to enjoy this setup. However, you can use zoom-in (Ctrl +) or zoom-out (Ctrl -) to calibrate this.

←

→

🕑

在这个可视化中，我们展示了基于Karp，Miller和Rosenberg（1972）的想法，通过按递增长度（1，2，4，8，...）排序后缀的前缀，即所谓的前缀倍增算法，来正确地构建后缀数组的O(n log n)方法。

我们限制输入只接受12个（由于可用的绘图空间，不能太长 - 但在后缀树的实际应用中，n可以是十万到百万个字符的顺序）大写（我们会删除您的小写输入）字母和特殊终止符'$'字符（即，[A-Z$]）。如果您没有在输入字符串的末尾写一个终止符'$'，我们将自动这样做。如果您在输入字符串的中间放一个'$'，它们将被忽略。如果您输入一个空的输入字符串，我们将默认为"GATAGACA$"。

为了方便，我们提供了一些通常在后缀树/数组讲座中找到的经典测试用例输入字符串，但为了展示这个可视化工具的强大，我们鼓励您输入任何您选择的12个字符的字符串（以字符'$'结束）。

请注意，LCP数组列在此操作中保持为空。它们将通过最长公共前缀操作单独计算。

Pro-tip 3: Other than using the typical media UI at the bottom of the page, you can also control the animation playback using keyboard shortcuts (in Exploration Mode): Spacebar to play/pause/replay the animation, ←/→ to step the animation backwards/forwards, respectively, and -/+ to decrease/increase the animation speed, respectively.

←

→

🕑

This Prefix Doubling Algorithm runs in O(log n) iterations, where for each iteration, it compares substring T[SA[i]:SA[i+k]] with T[SA[i+k]:SA[i+2*k]], i.e., in layman's terms: first compare two pairs of characters, then compare first two characters with the next two, then compare the first four characters with the next four, and so on.

This algorithm is best explored via visualization, see ConstructSA("GATAGACA$") in action (it is advisable that you exit this e-Lecture mode, run the algorithm in exploration mode, pause the algorithm and replay it frame-by-frame as there are too many elements changing especially during the first sorting iteration).

Time complexity: There are O(log n) prefix doubling iterations, and each iteration we call O(n) Radix Sort, thus it runs in O(n log n) — good enough to handle up to n ≤ 200K characters in typical programming competition problems involving long strings.

←

→

🕑

After we construct the Suffix Array of T in O(n log n), we can search for the occurrence of Pattern string T in O(m log n) by binary searching the sorted suffixes to find the lower bound (the first occurrence of P as a prefix of any suffix of T) and the upper bound positions (thelast occurrence of P as a prefix of any suffix of T).

Time complexity: O(m log n) and it will return an interval of size k where k is the total number of occurrences.

For example, on the Suffix Array of T = "GATAGACA$" above, try these scenarios:

P returns a range of rows: Search("GA"), occurrences = {4, 0}
P returns one row only: Search("CA"), occurrences = {2}
P is not found in T: Search("WONKA"), occurrences = {NIL}

PS: There is a slightly faster O(m+log n) variant that has not been visualized yet.

←

→

🕑

We can compute the Longest Common Prefix (LCP) of two adjacent suffixes (in Suffix Array order) in O(n) time using three phases of Kasai's algorithm. This algorithm takes advantage that if we have a long LCP between two adjacent suffixes (in Suffix Array order), that long LCP has lots of overlap with another suffix in positional order when its first character is removed.

The first phase: Compute the value of Phi[], where Phi[SA[i]] = SA[i-1] in O(n). This is to help the algorithm knows in O(1) time of which Suffix is behind Suffix-SA[i] in Suffix Array order. Try LCP("GATAGACA$") and focus on the first part on filling column Phi[] (it is advisable that you exit this e-Lecture mode, run the algorithm in exploration mode, pause the algorithm and replay it frame-by-frame as there are too many elements changing).

←

→

🕑

The second phase: Compute the PLCP[] values between a Suffix-i in positional order with Suffix-Phi[i] (the one behind Suffix-i in Suffix Array order). When we advance to the next index i+1 in positional order, we will remove the front most character of the suffix, but possibly retain lots of LCP value between Suffix-(i+1) and Suffix-Phi[(i+1)].

PLCP Theorem (not proven) shows that the LCP values can only be incremented up to n times, and thus can only be decremented at most n times too, making the overall complexity of the second phase to be also O(n).

Now, retry LCP("GATAGACA$") again and focus on the middle part on filling column PLCP[] (again, it is advisable that you exit this e-Lecture mode, run the algorithm in exploration mode, pause the algorithm and replay it frame-by-frame as there are too many elements changing).

←

→

🕑

The third phase: We compute the value of LCP[], where LCP[i] = PLCP[SA[i]] in O(n). This LCP values are the one that we use for other Suffix Array applications later.

Finally, retry LCP("GATAGACA$") again and focus on the last part on filling column LCP[] (as usual, exit this e-Lecture mode, run the algorithm in exploration mode, pause the algorithm and replay it frame-by-frame as there are too many elements changing).

Time complexity: Kasai's algorithm utilizes the PLCP theorem where the total number of increase (and decrease) operations of the value of the LCP is at most O(n). Thus Kasai's algorithm runs in O(n) overall. Thus, the combination of O(n log n) Suffix Array construction (via the Prefix Doubling algorithm) and the O(n) computation of LCP Array using this Kasai's algorithm is good enough to handle up to n ≤ 200K characters in typical programming competition problems involving long strings.

←

→

🕑

After we construct the Suffix Array of T in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one.

This is because each value LCP[i] the LCP Array means the longest common prefix between two lexicographically adjacent suffixes: Suffix-i and Suffix-(i-1). This corresponds to an internal vertex of the equivalent Suffix Tree of T that branches out to at least two (or more) suffixes, thus this common prefix of these adjacent suffixes are repeated.

The longest common (repeated) prefix is the required answer, which can be found in O(n) by going through the LCP array once.

Without further ado, try LRS("GATAGACA$"). We have LRS = "GA".

It is possible that T contains more than one LRS, e.g., try LRS("BANANABAN$").
We have LRS = "ANA" (actually overlap) or "BAN" (without overlap).

←

→

🕑

After we construct the generalized Suffix Array of the concatenation of both strings T1$T2# of length n = n1+n2 in O(n log n) and compute its LCP Array in O(n), we can find the Longest Repeated Substring (LRS) in T by simply iterating through all LCP values and reporting the largest one that comes from two different strings.

Without further ado, try LCS("GATAGACA$", "CATA#") on the generalized Suffix Array of string T1 = "GATAGACA$" and T2 = "CATA#". We have LCS = "ATA".

←

→

🕑

You are allowed to use/modify our implementation code for fast Suffix Array+LCP: sa_lcp.cpp | py | java | ml to solve programming contest problems that need it.

You have reached the last slide. Return to 'Exploration Mode' to start exploring!

Note that if you notice any bug in this visualization or if you want to request for a new visualization feature, do not hesitate to drop an email to the project leader: Dr Steven Halim via his email address: stevenhalim at gmail dot com.

←

🕑

构造后缀数组

搜索

最长公共前缀

最长重复子串

最长公共子串

关于团队使用条款
 隐私政策

关于

VisuAlgo最初由副教授Steven Halim于2011年构思，旨在通过提供自学、互动式学习平台，帮助学生更深入地理解数据结构和算法。

VisuAlgo涵盖了Steven Halim博士与Felix Halim博士、Suhendry Effendy博士合著的书《竞技编程》中讨论的许多高级算法。即使过去十年，VisuAlgo仍然是可视化和动画化这些复杂算法的独家平台。

虽然VisuAlgo主要面向新加坡国立大学（NUS）的学生，包括各种数据结构和算法课程（例如CS1010/等价课程，CS2040/等价课程（包括IT5003），CS3230，CS3233和CS4234），但它也是全球好奇心的宝贵资源，促进在线学习。

最初，VisuAlgo并不适用于智能手机等小触摸屏，因为复杂的算法可视化需要大量的像素空间和点击拖动交互。为了获得最佳用户体验，建议使用最低分辨率为1366x768的屏幕。然而，自2022年4月以来，VisuAlgo的移动（精简）版本已经推出，使得在智能手机屏幕上使用VisuAlgo的部分功能成为可能。

VisuAlgo仍然在不断发展中，正在开发更复杂的可视化。目前，该平台拥有24个可视化模块。

VisuAlgo配备了内置的问题生成器和答案验证器，其“在线测验系统”使学生能够测试他们对基本数据结构和算法的理解。问题根据特定规则随机生成，并且学生提交答案后会自动得到评分。随着越来越多的计算机科学教师在全球范围内采用这种在线测验系统，它可以有效地消除许多大学标准计算机科学考试中手工基本数据结构和算法问题。通过给通过在线测验的学生分配一个小但非零的权重，计算机科学教师可以显著提高学生对这些基本概念的掌握程度，因为他们可以在参加在线测验之前立即验证几乎无限数量的练习题。每个VisuAlgo可视化模块现在都包含自己的在线测验组件。

VisuAlgo已经被翻译成三种主要语言：英语、中文和印尼语。此外，我们还用各种语言撰写了关于VisuAlgo的公开笔记，包括印尼语、韩语、越南语和泰语：

id, kr, vn, th.

团队

项目领导和顾问（2011年7月至今）
Associate Professor Steven Halim, School of Computing (SoC), National University of Singapore (NUS)
Dr Felix Halim, Senior Software Engineer, Google (Mountain View)

本科生研究人员 1
CDTL TEG 1: Jul 2011-Apr 2012: Koh Zi Chun, Victor Loh Bo Huai

最后一年项目/ UROP学生 1
Jul 2012-Dec 2013: Phan Thi Quynh Trang, Peter Phandi, Albert Millardo Tjindradinata, Nguyen Hoang Duy
Jun 2013-Apr 2014 Rose Marie Tan Zhao Yun, Ivan Reinaldo

本科生研究人员 2
CDTL TEG 2: May 2014-Jul 2014: Jonathan Irvin Gunawan, Nathan Azaria, Ian Leow Tze Wei, Nguyen Viet Dung, Nguyen Khac Tung, Steven Kester Yuwono, Cao Shengze, Mohan Jishnu

最后一年项目/ UROP学生 2
Jun 2014-Apr 2015: Erin Teo Yi Ling, Wang Zi
Jun 2016-Dec 2017: Truong Ngoc Khanh, John Kevin Tjahjadi, Gabriella Michelle, Muhammad Rais Fathin Mudzakir
Aug 2021-Apr 2023: Liu Guangyuan, Manas Vegi, Sha Long, Vuong Hoang Long, Ting Xiao, Lim Dewen Aloysius

本科生研究人员 3
Optiver: Aug 2023-Oct 2023: Bui Hong Duc, Oleh Naver, Tay Ngan Lin

最后一年项目/ UROP学生 3
Aug 2023-Apr 2024: Xiong Jingya, Radian Krisno, Ng Wee Han, Tan Chee Heng
Aug 2024-Apr 2025: Edbert Geraldy Cangdinata, Huang Xing Chen, Nicholas Patrick

List of translators who have contributed ≥ 100 translations can be found at statistics page.

致谢
NUS教学与学习发展中心（CDTL）授予拨款以启动这个项目。在2023/24学年，Optiver的慷慨捐赠将被用来进一步开发 VisuAlgo。

使用条款

VisuAlgo慷慨地向全球计算机科学界提供免费服务。如果您喜欢VisuAlgo，我们恳请您向其他计算机科学学生和教师宣传它的存在。您可以通过社交媒体平台（如Facebook、YouTube、Instagram、TikTok、Twitter等）、课程网页、博客评论、电子邮件等方式分享VisuAlgo。

数据结构与算法（DSA）的学生和教师可以直接在课堂上使用本网站。如果您从本网站截取屏幕截图或视频，可以在其他地方使用，但请引用本网站的URL（https://visualgo.net）和/或下面的出版物列表作为参考。但请不要下载VisuAlgo的客户端文件并将其托管在您的网站上，因为这构成了抄袭行为。目前，我们不允许他人分叉此项目或创建VisuAlgo的变体。个人使用离线副本的客户端VisuAlgo是可以接受的。

请注意，VisuAlgo的在线测验组件具有重要的服务器端元素，保存服务器端脚本和数据库并不容易。目前，普通公众只能通过“培训模式”访问在线测验系统。“测试模式”提供了一个更受控制的环境，用于在新加坡国立大学的真实考试中使用随机生成的问题和自动验证。

出版物列表

这项工作曾在2012年国际大学生程序设计竞赛（波兰，华沙）的CLI研讨会上和2012年国际信息学奥林匹克竞赛（意大利，锡尔米奥内-蒙蒂基亚里）的IOI会议上展示过。您可以点击此链接阅读我们2012年关于该系统的论文（当时还没有称为VisuAlgo），以及此链接阅读2015年的简短更新（将VisuAlgo与之前的项目关联起来）。

错误报告或新功能请求

VisuAlgo并不是一个完成的项目。Steven Halim副教授仍在积极改进VisuAlgo。如果您在使用VisuAlgo时发现任何可视化页面/在线测验工具中的错误，或者您想要请求新功能，请联系Steven Halim副教授。他的联系方式是将他的名字连接起来，然后加上gmail dot com。

隐私政策

版本 1.2 (更新于2023年8月18日星期五)。

自2023年8月18日（星期五）起，我们不再使用 Google Analytics。因此，我们现在使用的所有 cookies 仅用于此网站的运营。即使是首次访问的用户，烦人的 cookie 同意弹窗现在也已关闭。

自2023年6月7日（星期五）起，由于 Optiver 的慷慨捐赠，全世界的任何人都可以自行创建一个 VisuAlgo 账户，以存储一些自定义设置（例如，布局模式，默认语言，播放速度等）。

此外，对于 NUS 学生，通过使用 VisuAlgo 账户（一个 NUS 官方电子邮件地址，课堂名册中的学生姓名，以及在服务器端加密的密码 - 不存储其他个人数据），您同意您的课程讲师跟踪您的电子讲义阅读和在线测验培训进度，这是顺利进行课程所必需的。您的 VisuAlgo 账户也将用于参加 NUS 官方的 VisuAlgo 在线测验，因此，将您的账户凭据传递给他人代您进行在线测验构成学术违规。课程结束后，您的用户账户将被清除，除非您选择保留您的账户（OPT-IN）。访问完整的 VisuAlgo 数据库（包含加密密码）的权限仅限于 Halim 教授本人。

对于全球其他已经给 Steven 写过信的 CS 讲师，需要一个 VisuAlgo 账户（您的（非 NUS）电子邮件地址，您可以使用任何显示名称，以及加密密码）来区分您的在线凭据与世界其他地方。您的账户将具有 CS 讲师特定的功能，即能够查看隐藏的幻灯片，这些幻灯片包含了在隐藏幻灯片之前的幻灯片中提出的问题的（有趣的）答案。您还可以访问 VisuAlgo 在线测验的 Hard 设置。您可以自由地使用这些材料来增强您的数据结构和算法课程。请注意，未来可能会有其他 CS 讲师特定的功能。

对于任何拥有 VisuAlgo 账户的人，如果您希望不再与 VisuAlgo 工具有关联，您可以自行删除您的账户。