Name______________________________

Data Structures and Algorithms I - Lab 12

This lab covers the use of the Map interface and HashMap class from the JCF. This collection allows very fast add and find operations in a collection utilizing the concept of a hash code. Every object inherits the hashCode method from class Object. This method returns a number calculated from the internal details of the object. Although not guaranteed, the hashCode for two different objects is likely to be different.You should print a copy of this lab so you can record your answers to questions. Turn in the completed questions and any supporting printouts as required in the lab. Here is a link to the API documentation in case you need it.

P1: Create a project for Lab 12 that contains the following class, HashTest.java. We will be testing hash codes for Objects and Strings. Our goal is to determine how likely it is for two different things to have the same hash code. Run the program and record the hash codes for each item:

a:__________________, a2: ____________________, b:___________________

s:__________________, s2: ____________________, 

t: __________________, u: ____________________

Note that a==a2, and s==s2 (these refer to the same objects).

Note also that s.equals(t), but it is not true that t.equals(u).

You need to check the API for the following answer. Look at the specs for hashCode in class Object. If x and y are variables that refer to distinct objects (of any type), and x.equals(y) is true, should the hash codes for x and y be the same? 

Answer: _______

If x.hashCode() is 300 at the start of a loop, then is it true that x.hashCode() must be 300 at the conclusion of the loop.

Answer (this may require a little explanation): 

If x.hashCode()==y.hashCode(), then can you conclude x.equals(y) is true?

Answer: _______

The class String overrides hashCode. The hashCode for the String "ab" was recorded above. Show the actual calculation (with real numbers) used to obtain this particular number. You may want to consult the String class hashCode method documentation. You may also want to lookup the ASCII codes for the letters 'a' and 'b'.

Answer:

P2: Add the Pair.java file to your project. This is a simple class whose instances contain an int and some other Object. We will use this in the next part of the lab. We will need an equals method that reports two Pairs as equal if and only if they have the same integer component (key). Equality testing will ignore the second component.

Before overriding equals, run the test program. There are five Pair variables. Which two are equal and why are these the only ones that are equal?

Answer:

Of these Pairs, if equals is defined as stated above (same keys), which Pairs should be equal in the output?

Answer:

Implement equals to return true if the first components (key) of the Pairs are exactly the same. Verify that equals performs correctly on the test program. There should be two pairs reporting equality and two reporting the opposite.

Uncomment the method calls in main that test the hash code method. The provided implementation for hashCode returns an integer value each time it is called. Run the test program and read the specs for hashCode in class Object. Test1 illustrates an example for this hashCode implementation that violates a stipulation in the specs. Which stipulation?

Answer:

Change hashCode to return the following: value.hashCode(). This will correct the previous problem, but introduces another. Test2 illustrates an example for this hashCode implementation that violates a stipulation in the specs. Which stipulation?

Answer:

Change hashCode to return the following: key*key. This will correct both of the prior problems, but may result in an inefficiency. Test3 illustrates an example for this hashCode implementation that violates a recommendation in the specs. Which recommendation?

Answer:

Change hashCode to return the following:  key* value.hashCode(). This addresses the previous problem, but will violate another stipulation in the specs. Which violation is illustrated by Test4 for this hashCode implementation?

Answer:

Obtaining a good hash code from an object is not a simple matter. The efficiency of hash table organization depends on good hash values. For the rest of the project, we will not actually need the hashCode method of Pair. However, if we decided to store Pairs in a hash table, we would want to spend some time designing hashCode so it is efficient (fast to calculate) and meets the stipulations (especially the connection to .equals).

P3: Add HashCollisions.java to the project. This program will generate a lot of random Strings. Each String will have a hash code. The goal is to determine if two different Strings have the same hash code. To accomplish this, we will need to be able to determine if a hash code is used more than once. We will also need to know which string had that hash code (in case we randomly generate the same string twice). We therefore need to create a collection of (hash code, String) pairs that we can search by hash code. We will start by creating an ArrayList of these Pairs. We want Pair p1.equals(p2) if p1.firstComponent == p2.firstComponent (these would be int's). Since this is how we overloaded the equals method of the Pair class, we can use the indexOf method of ArrayList to search for a repeated hash code value.  

The program prints a dot for every 1000 Strings processed to apprise you of progress. The program generates a random String, searches to see if its hashCode is already in the list, and if not, adds the Pair (hash code, String) to the list. You probably know that the sequential search used in ArrayList is O(n), and each Pair added to the list increases the list size. Thus we are doing overall 0+1+2+3+...+(N-1) comparisons (in the worst case... and pretty much the worst case rules since we will rarely get the same hash code). The value of N is 100,000 (the loop limit). Calculate this sum, so you have a good idea of how many comparisons are being performed overall:

1+2+3+...+99,999 = ___________________________________ (hint: Don't do this by adding the numbers together on your calculator... there is a formula)

When two different objects (unequal) generate the same hashCode, they are said to collide. Collisions are supposed to be rare. Run the program and write down a couple of Strings that collide. You should expect 0-5 collisions on an average run. If you get something totally different, suspect the equals method of the Pair class. You may need to run the program several times to get answers:

____________ and ______________ both have hash code ________________

____________ and ______________ both have hash code ________________

How long did the program take to run (approximately): _________________ 
(you may cancel execution if it takes over 5 minutes - you should then go find a faster computer)

The type of structure we are using to store integers paired with Strings is very efficiently implemented in the Hashtable class provided in the JCF. This class is optimized to allow fast lookups. Incidentally, the underlying implementation uses hashCodes to store and retrieve Objects. This will only be fast if the hashCodes are really fairly unique. 

We will no longer need our Pair class, since a Hashtable stores key,value pairs already. Modify the program by commenting out the lines on the left, replacing them with the statements on the right.

ArrayList<Pair<String>> h = new ArrayList<Pair<String>>();Hashtable<Integer, String> h = new Hashtable<Integer, String>();
Pair<String> p = new Pair<String>(hv, temp);
int loc = h.indexOf(p);
Object found = null;
if (loc > -1)
  found = h.get(loc);
Object found = something to find and get the string stored by its hash code value (hv) out of the table (h)
h.add(p);

Something to put the string (temp) into the hash table (h) with the hash value (hv) as the key
 

You can see that instead of add, the HashTable uses the method put. This requires two arguments. We are passing an int (the key to the data in the table and the value we will be searching for) and a String. We want to detect duplicate hashCode values. 

Since hashCodes are ints, why didn't we declare the Hashtable like this: Hashtable<int, String> h; ?

Answer:

The (Integer, String) pairs (representing a hashcode of a string, and the string) will be stored and located using the hashCode of the Integer objects. Lookup the hashCode method for class Integer. How does it calculate a hash code?

Answer:

If Integer x = new Integer(2987464), y = new Integer(-3438576); is there any possibility that x.hashCode()==y.hashCode()?

Answer and explanation:

Run the new program. How long does it take to generate 100,000 random Strings, add them to the storage structure, looking each up in the process?

Answer: __________________

Increase the number loop iterations to 500,000. You would not do this with the ArrayList! Why not?

How long does it take now? 

Answer:

The put and get operations are O(1) for a hashtable. Regardless of how large the collection grows, it takes roughly the same amount of time to insert or locate any element.

If you try running the program with 750,000 iterations, you will get an exception. What is it?

Answer:

There is only so much room in memory (the heap) for the Strings we are creating. If each String is seven characters in length, these will use up at least 14 bytes (unicode) plus whatever overhead is needed for the rest of the String object. Let's conservatively estimate each String as needing 50 bytes (this is below what is actually consumed!!!). How much storage is consumed just by the Strings in the program (with 500,000 iterations)?

Answer:

______________________ bytes or about _____________ MB

The default maximum heap size for the Java Virtual Machine is 64 MB. Imagine the additional overhead used by your program. You can see that the number of Strings you are generating is going to approach this limit pretty quickly. You can make the maximum larger by specifying a parameter to the JVM when it is executed. We will just accept the default limit for our experiments.

As the Hashtable grows in size, it is forced to reallocate storage for the elements it contains. This is required for two reasons. First, more data requires more space, and under the wrapping of the Hashtable is an array. As the array gets partially filled, performance degrades, so a larger array must be allocated. Reallocation takes time, but with careful planning, this should not affect performance very dramatically. 

How can you specify a larger initial capacity for a Hashtable?

Answer:

  

Print the two files (Pair and HashCollisions) and attach them to the lab answers. Be sure your files contain your name and the current date. Submit your answers and program solutions in a single stapled package.