Beware of Collection.retainAll in Java!

Yesterday I got a nasty bug! Basically, I observed that a map – which was populated with about 20 values (no significance of particular number here) on first access (in a static manner) – was missing most of those values when accessed later. Begun the debugging.. and finally the culprit was caught.

Culprit happened to be the function in Java collections:

boolean retainAll(Collection<?> c)

This function retains only the elements contains in the original collection. Basically this gets called over a collection (let’s call it original) and takes as an argument another collection (let’s call it smallset). I will reproduce a simplified version of my code here to explain the issue and how to resolve it:

class A {
 public static void main(String[] args) throws Exception {
 Map<String, Integer> original = new HashMap();
 original.put("Pen", 1);
 original.put("Color", 2);
 original.put("Paper", 3);
 original.put("Envelope", 4);
 original.put("Eraser", 5);
 original.put("Crayon", 6);
 System.out.println("BEFORE: original map size: " 
              + original.size());

 Set<String> smallset = new HashSet();
 smallset.add("Color");
 smallset.add("Crayon");
 System.out.println("BEFORE: smallset set size: " 
             + smallset.size());


 Set<String> originalKeys = original.keySet();
 originalKeys.retainAll(smallset);
 System.out.println("AFTER: original map size: " 
    + original.size());
 System.out.println("AFTER: smallset set size: " 
   + smallset.size());
 }
}

In short, there is an original HashMap which contains six different keys and there is a small set of values with which we want the intersection from original Map. We get the keySet from the original map and call retainAll on the same. Idea is to get the intersection of the map keys and the small set. Output of this program is:

mawasthi@mawasthi-1:~/scratch$ java -cp . A
BEFORE: original map size: 6
BEFORE: smallset set size: 2
AFTER: original map size: 2
AFTER: smallset set size: 2

So you get it – right? Since we used keySet() on Map, since Java returns references and since retainAll has a side-effect of modifying the original map and hence our original map becomes as short as passed small set.

How to fix this?

1) Clone the Map.keySet() output. OR
2) Create a new HashSet and do a addAll of Map.keySet to it.

Idea in (1) and (2) is to create a copy.

May be this is because I’m coming from the C/C++ background (who is in love with the concepts of functional programming aka no side-effects) that I hate the arguments getting modified in any manner within a function.

Advertisements

Understanding unicode in python and writing text in devanagri script

This Unicode HOWTO by Python Software Foundation is a short but informative read about Unicode handling in python. Understanding the brief history of characters standardization from ASCII to Unicode is important not only when you’re working on the stuff that needs it but also since more and more services interact with each other over Web and internationalization (or localization or l10n) is omnipresent.

In python (v2.7) playing around with unicode is super easy.

A normal string in python:

$ s = "abcde"
$ type(s)
str 

A unicode string in python:

$ s = unicode("abcde")
$ type(s)
unicode 

A character is the smallest possible component of a text. ‘A’, ‘B’, ‘C’, etc., are all different characters. So are ‘È’ and ‘Í’. The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal). Strictly, these definitions imply that it’s meaningless to say ‘this is character U+12ca’. U+12ca is a code point, which represents some particular character; in this case, it represents the character ‘ETHIOPIC SYLLABLE WI’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.

So, a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff.

The rules for translating a Unicode string into a sequence of bytes are called an encoding. UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.) UTF-8 uses the following rules:

  • If the code point is <128, it’s represented by the corresponding byte value.
  • If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
  • Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

UTF-8 has several convenient properties:

  • It can handle any Unicode code point.
  • A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
  • A string of ASCII text is also valid UTF-8 text.
  • UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.

That was some important and interesting theory.

Now out of curiosity I wanted to print Devanagari script characters (Hindi characters to be precise). So I searched over google for Devanagari characters unicode list and found this page from unicode.org comprising of all devanagari script unicode values.

first_name = u'\u092e' + u'\u0928' + u'\u094b' + u'\u091c'
last_name = u'\u0905' + u'\u0935' + u'\u0938' + u'\u094d' \
+ u'\u0925' + u'\u0940'

print first_name + ' ' + last_name

This prints the devanagari characters (basically my name). Output on a ipython notebook is clear and on terminal bit unclear. Looks beautiful.

मनोज अवस्थी