Columbia University 哥伦比亚大学
COMS W3134 Data Structures in Java
Final Exam Capstone Project: Most Anagrams Finder
1. Objective
Your goal is to use several data structures and a sorting algorithm you developed this semester to develop a program that finds the word or words with the most anagrams. Insertion sort works well on small amounts of data, which is perfect for this task. BSTreeMap, RBTreeMap, and MyHashMap implement the MyMap interface. So, with proper object-oriented programming techniques, specifically the use of polymorphism, you can create any one of these data structures and refer to it from a MyMap reference. Though all three classes implement the same interface, their methods are implemented very differently, leading to different execution times on the computer. You will time your program’s execution to see how each of the data structures performs.
2. Problem
You will develop a command line Java application to find the word or words with the most anagrams. An anagram is another word that uses all the characters of the given word, just in a different order or with a different capitalization. For instance, “tea” and “eat” are anagrams; “tae” and “aet” are merely permutations of the characters of the word and are not considered anagrams because they are not words provided in the dictionary file. “Tea” and “Eat” are anagrams even though they have different capital letters. The name of the public class must be MostAnagramsFinder, and it must reside in a file named MostAnagramsFinder.java.
The program will take in two command line arguments. If the number of arguments supplied is incorrect, the program will print the usage message "Usage: java MostAnagramsFinder <dictionary file> <bst|rbt|hash>" and a newline character to stderr and exit in failure.
If the number of command line arguments is correct, the program will then verify that the dictionary file exists. If it does not, the program will print the message "Error: Cannot open file '" + args[0] + "' for input." and a newline character to stderr, where args[0] contains the name of the dictionary file the user is attempting to open, and exit in failure.
The program will then try to parse the command line for which data structure the user wishes to use. Valid strings are “bst”, “rbt”, and “hash”. Any other string including those with different capitalization is considered invalid, causing the program to print the message
"Error: Invalid data structure '" + args[1] + "' received." and a newline character to stderr, where args[1] contains the data structure string name, and exit in failure.
After validating the command line arguments, you’ll need to instantiate either a BSTreeMap, RBTreeMap, or MyHashMap. You must create a reference to the interface rather than the class itself, as in: MyMap<String, Integer> map = new BSTreeMap<>(); No credit will be given if your map variable is not of type MyMap. The key-to-value mapping is String-to-MyList<String>. The key must be the sorted, lowercase spelling of the word. You must use/modify the insertion sort method given in class to sort the character of the string. Strings are immutable, so use an array of chars. As an example, when processing the word “Tea”, the key should be “aet”, since all characters have been made lowercase and the characters within the word have been sorted lexicographically. The value is MyList<String>, a MyLinkedList of all the words that are anagrams of the key. The elements in the MyLinkedList may be stored in any order and must be the word with its original capitalization.
The dictionary file contains a list of all the words to process. The file is stored in Unix/Mac format, where each line ends with a single '\n' character. No word will appear twice in the dictionary with the same exact capitalization. The order of the words in the file is random; it may or may not be sorted. For each word in the file, convert it to lowercase characters, sort the characters, and then insert the key-value pair into the map.
If an I/O error occurs as the reading takes place, the program must print the message "Error: An I/O error occurred reading '" + args[0] + "'."
and a newline character to stderr, where args[0] contains the name of the file the user is attempting to process, and exit in failure.
If no anagrams are found in the input file, the output is the string literal "No anagrams found." and a newline character. If anagrams are present, the first line of output displays the number of groups of anagrams and the anagram count. Sometimes, there is one group of words of maximal length, but sometimes, there are ties, and several groups of anagrams are of maximal length. Always print
"Groups: " + groupCount + ", Anagram count: " + anagramCount followed by a newline character. Each group is enclosed in square brackets []. The words in the group are sorted lexicographically. That means that words beginning with a capital letter will appear in the output before words beginning with a lowercase letter. The groups themselves are sorted lexicographically by the first word in the list.
3. Requirements
You must use the supplied implementations of the BSTreeMap, RBTreeMap, and MyHashMap. You must not modify these files in any way. You must write all your code in MostAnagramsFinder.java. We will copy the source for the data structures into the same folder as your source code when grading your submission. If you wish to write additional classes to help you encapsulate the data you process, put the non-public class inside MostAnagramsFinder.java.
You must not import any other packages from the Java API. They will be stripped away before grading, and your code will not compile.
You are expected to comment your code. You must use Javadoc for every method. Additionally, inline comments are recommended.
Any method implementing the insertion sort algorithm must be named "insertionSort". The argument list and return type is left to your discretion. You must use/modify the insertion sort method given in class. It is acceptable and even necessary to have several overloaded versions of the method with different input arguments.
You may use the Internet to help with various parts of this assignment, including reading from a file, using Iterators, etc. Be sure to cite the source in an inline comment above the lines of code you reference. You may not consult with another human. You may not use openai.com or any other AI tools. Since the project is open-ended, there are many different ways you can approach solving the problem. We will use Moss to check your code against that of every other student in the class. If your code is too similar to someone else’s in the class, you will fail the course. Please do everything possible to avoid this scenario!
你的目标是使用本学期开发的几种数据结构和一个排序算法,开发一个程序来找到具有最多字母重排(anagrams)的单词。插入排序非常适合处理少量数据,非常适合此任务。BSTreeMap、RBTreeMap 和 MyHashMap 实现了 MyMap 接口。因此,通过适当的面向对象编程技术,特别是多态性的使用,你可以创建这些数据结构中的任何一个,并通过 MyMap 引用来引用它们。虽然这三个类都实现了相同的接口,但它们的方法实现非常不同,导致在计算机上有不同的执行时间。你将对程序的执行时间进行计时,以查看每个数据结构的表现。
你将开发一个命令行 Java 应用程序,找到具有最多字母重排(anagrams)的单词。字母重排是指使用给定单词的所有字符但顺序不同或大小写不同的另一个单词。例如,“tea”和“eat”是字母重排;“tae”和“aet”只是字符的排列,不被认为是字母重排,因为它们不是字典文件中提供的单词。即使“Tea”和“Eat”具有不同的大小写,它们也是字母重排。公共类的名称必须为 MostAnagramsFinder,并且必须位于名为 MostAnagramsFinder.java 的文件中。
程序将接受两个命令行参数。如果提供的参数数量不正确,程序将打印使用消息 "Usage: java MostAnagramsFinder <dictionary file> <bst|rbt|hash>" 并向标准错误输出(stderr)打印一个换行符,然后以失败退出。
如果命令行参数数量正确,程序将验证字典文件是否存在。如果不存在,程序将打印消息 "Error: Cannot open file '" + args[0] + "' for input." 并向标准错误输出打印一个换行符,其中 args[0] 包含用户试图打开的字典文件的名称,然后以失败退出。
然后程序将尝试解析用户希望使用的数据结构的命令行参数。有效的字符串包括 "bst"、"rbt" 和 "hash"。任何其他字符串,包括那些具有不同大小写的字符串,都被视为无效,会导致程序打印消息 "Error: Invalid data structure '" + args[1] + "' received." 并向标准错误输出打印一个换行符,其中 args[1] 包含数据结构字符串名称,然后以失败退出。
在验证命令行参数后,你需要实例化 BSTreeMap、RBTreeMap 或 MyHashMap。你必须创建一个指向接口的引用,而不是类本身,例如:MyMap<String, Integer> map = new BSTreeMap<>(); 如果 map 变量的类型不是 MyMap,将不予计分。键值映射是从 String 到 MyList<String> 的映射。键必须是单词的排序后的、小写形式。你必须使用/修改课堂上给出的插入排序方法来对字符串的字符进行排序。字符串是不可变的,所以请使用字符数组。作为示例,当处理单词 "Tea" 时,键应为 "aet",因为所有字符都已转换为小写,并且单词中的字符已按字典顺序排序。值是 MyList<String>,即所有该键字母重排单词的 MyLinkedList。MyLinkedList 中的元素可以按任何顺序存储,并且必须是具有原始大小写的单词。
字典文件包含所有要处理的单词列表。文件以 Unix/Mac 格式存储,其中每行以单个 \n 字符结尾。字典中不会有两次出现完全相同大小写的单词。文件中的单词顺序是随机的;它可能已经排序也可能没有排序。对于文件中的每个单词,将其转换为小写字符,排序这些字符,然后将键值对插入到映射中。
如果在读取过程中发生 I/O 错误,程序必须打印消息 "Error: An I/O error occurred reading '" + args[0] + "'." 并向标准错误输出打印一个换行符,其中 args[0] 包含用户试图处理的文件的名称,然后以失败退出。
如果在输入文件中没有找到字母重排,则输出字符串文字 "No anagrams found." 并换行。如果存在字母重排,第一行输出显示字母重排组的数量和字母重排数量。有时,最大长度的单词组只有一个,但有时也会出现平局,多个字母重排组的长度相同。始终打印 "Groups: " + groupCount + ", Anagram count: " + anagramCount 并换行。每个组用方括号 [] 括起来。组内的单词按字典顺序排序。这意味着以大写字母开头的单词在输出中会出现在以小写字母开头的单词之前。组本身按列表中第一个单词的字典顺序排序。
你必须使用提供的 BSTreeMap、RBTreeMap 和 MyHashMap 实现。不得以任何方式修改这些文件。你必须将所有代码编写在 MostAnagramsFinder.java 中。我们将在评分时将数据结构的源代码复制到与你的源代码相同的文件夹中。如果你希望编写其他类来帮助封装你处理的数据,请将非公共类放在 MostAnagramsFinder.java 中。你不得导入 Java API 的任何其他包。它们将在评分前被剥离掉,你的代码将无法编译。你需要对代码进行注释。每个方法都必须使用 Javadoc。另外,建议使用内联注释。
实现插入排序算法的方法必须命名为 "insertionSort"。参数列表和返回类型由你自行决定。你必须使用/修改课堂上给出的插入排序方法。允许并且在某些情况下甚至需要有几个重载的方法,具有不同的输入参数。
咨询 Alpha 小助手,获取更多课业帮助