Answers file ============ Replace the placeholders "..." with your answers. 0. Who are the group members? Gustav Eurén, Hussein Mansour, John Croft 00. Did you run your tests using PyPy? No (You can answer "no" if you don't know about PyPy) Part 1: Complexity analysis --------------------------- 1a. What is the asymptotic complexity of 'compute_intersections'? Justify your answer. Answer in terms of N, the total number of 5-grams in the input files. You may make the following assumptions: - There are D documents. - Each document has the the same number of 5-grams K. - K is larger than D. - There is not much plagiarised text, that is, most 5-grams occur in only one file. Specifically, the number of duplicate occurrences of 5-grams is a small *constant*. Complexity: O(N**2) Justification: The nested for-loops iterating over the documents makes D iterations each however, since the order doesn't matter, each pair of document is only used once and pairs of a document with itself isn't relevant, meaning the total number of iterations is D(D-1)/2. The underlying nested for-loops, iterating over the 5-grams of each document, follow the same logic except all pairs are relevant so the amount of iterations is K**2. The final statement, checking weather two 5-grams match, is assumed to be constant and small and won't matter for asymptotic complexity. With this logic the asymptotic complexity becomes O(D(D-1)/2) * O(K**2) * O(1) = O(1/2) * O(D**2-D) * O(K**2) = max(D**2,-D) * O(K**2) = O((D*K)**2) = O(N**2) 1b. How long did the program take compute the intersections for the 'tiny', 'small' and 'medium' directories? (You can skip the 'medium' directory if it takes more than 5 minutes.) tiny: 6.48 s small: 22.4 s medium: ... s 1c. Is the ratio between the times what you would expect, given the asymptotic complexity? Explain very briefly why. The ratio between the times is approximately 3.46 which is a little bit smaller than what you would expect namely 20000**2/10000**2=4. It could be becuase of how the big O-notation reduces the complexity of the first nested for-loops to N-squared. 1d. How long do you predict the program would take to compute the intersections for the 'big' and 'huge' directories, respectively? Show your calculations. Do the same for the 'medium' directory too if you didn't run it in 1b. Write your answer in something human-readable (e.g., minutes/hours/days/months/years). medium: 9.33 min big: 15.56 h huge: 10.37 days Calculations: medium: 22.4s * (100 000/20 000)**2 = 560 s = 9.33 min big: 22.4s * (1 000 000/20 000)**2 = 56 000 s = 15.56 h huge: 22.4s * (4 000 000/20 000)**2 = 896 000 s = 10.37 days Part 2: Using an intermediate data structure -------------------------------------------- 2a. How long time does it take to (1) build the n-gram index, (2) compute the intersections, for the 'small' directory? build_ngram_index: 36.85 s compute_intersections: 10.28 s 2b. Was this an improvement compared to the original implementation (see 1b)? No Part 3: Implementing a BST -------------------------- 3a. How long time does it take to (1) build the n-gram index, (2) compute the intersections, for the 'tiny', 'small' and 'medium' directories? If you get a stack overflow or recursion error, just say so. Note: You might see a *slowdown* compared to above... Don't be alarmed, you will solve all this in part 4. tiny: - build_ngram_index: 1.83 s - compute_intersections: 0.33 s small: - build_ngram_index: 11.98 s - compute_intersections: 2.15 s medium: - build_ngram_index: 36.78 s - compute_intersections: 6.58 s 3b. Which of the BSTs appearing in the program usually become unbalanced? (The BSTs are 'file_ngrams', 'ngram_index', 'intersections'.) file_ngrams maps each file to its n-grams and since it does that by iterating through each file-path, which come in ascending order, the height of the BST will be the same as the number of files and unbalance to the right. 3c (optional). Is there a simple way to stop these trees becoming unbalanced? (I.e., without using a self-balancing data structure.) If the input file-paths are added to the BST in a random order, choosing one of the files to insert is a uniformly distributed random variable of the files that haven't been inserted yet meaning the expected value is the "middle" file. Therefore, on average the lowest and highest files are picked last and the BST will be balanced on average. Part 4: Implementing an AVL tree -------------------------------- 4a. How long time does it take to (1) build the n-gram index, (2) compute the intersections, for the 'small', 'medium' and 'big' directories? small: - build_ngram_index: 0.3 s - compute_intersections: 0.1 s medium: - build_ngram_index: 1.9 s - compute_intersections: 0.4 s big: - build_ngram_index: 26.4 s - compute_intersections: 5.7 s For the below questions, we denote by N the total number of 5-grams. We assume there is a (small) constant number of duplicate occurrences of 5-grams. 4b. What is the asymptotic complexity of 'build_ngram_index'? Justify briefly. Complexity: O(n log n) Justification: There are n 5-grams that the algorithm needs to handle, and every time a 5-gram is inserted in the avl tree it takes at most log n to insert it. So the n in the complexity shows n amount of 5-grams and the log n shows how much time every 5-gram takes. 4c. What is the asymptotic complexity of 'compute_intersections'? Justify briefly. Complexity: O(n log n) Justification: Here as well each 5-gram is handled one time, and the opration done on them takes log n. So the complexity become n for each 5-gram and log n for the operation. 4d. The 'huge' directory contains 4 times as many n-grams as the 'big'. Approximately how long time will it take to run the program on 'huge', given that you know how long time it took to run on 'big' (or 'medium')? Justify briefly. If you have the patience you can also run it to see how close your calculations were. Theoretical time to run 'huge': 141.24 Justification: we have 2 operations that has the complexity O(n log n) and the huge files has 4 times the amount as the big files. The increase factor is almost 4.4, becase (4n * log 4n) / (n log n) ≈ 4.4 when n is 1 000 000 which is the size of big. This means when the n grows by 4 times, the run time will increase by almost 4.4 times. And if we try this on the big run times we get: 26.4 * 4.4 = 116.16 5.7 + 4.4 = 25.08 116.16 + 25.08 = 141.24 (Optional) Actual time to run 'huge': 113.6 + 38.4 = 152 4e. Briefly compare your answer in 4d, with your answer in 1d. The estimated time to compute intersections for the 'huge' directory using an AVL tree as data structure to implement a map is 25.08 s (and the actual run time was 38.4 s) compared to the estimated time using a list of key-value pairs as data structure which was 10.37 days. Note that apart from using an AVL tree instead of the list, the latter implementation also uses an intermediate AVL which improved the run time of computing intersections but more time was spent building the n-gram index instead (ofcourse no time was spent here during the first implementation) and the total time of the program was slightly worse for the first implementation. However, using the AVL tree still improved the efficiency of the program a lot. 4f (optional). Instead of the previous assumption, we now allow an arbitrary total similarity score S. What is the asymptotic complexity of the two functions in terms of both N and S (at the same time)? Complexity of 'build_ngram_index': ... Justification: ... Complexity of 'compute_intersections': ... Justification: ... Appendix: general information ----------------------------- A1. How many hours did you spend on the assignment? John: 8-9 hours A2. Do you know of any bugs or limitations? No. A3. Did you collaborate with any other students on this lab? If so, write with whom and in what way you collaborated. Also list any resources (including the web) you have used in creating your design. No collaboration. Course book and lecture materials. A4. Did you encounter any serious problems? No. A5. Do you have other comments or feedback? ...