Beware of Collection.retainAll in Java!

Yesterday I got a nasty bug! Basically, I observed that a map – which was populated with about 20 values (no significance of particular number here) on first access (in a static manner) – was missing most of those values when accessed later. Begun the debugging.. and finally the culprit was caught.

Culprit happened to be the function in Java collections:

boolean retainAll(Collection<?> c)

This function retains only the elements contains in the original collection. Basically this gets called over a collection (let’s call it original) and takes as an argument another collection (let’s call it smallset). I will reproduce a simplified version of my code here to explain the issue and how to resolve it:

class A {
 public static void main(String[] args) throws Exception {
 Map<String, Integer> original = new HashMap();
 original.put("Pen", 1);
 original.put("Color", 2);
 original.put("Paper", 3);
 original.put("Envelope", 4);
 original.put("Eraser", 5);
 original.put("Crayon", 6);
 System.out.println("BEFORE: original map size: " 
              + original.size());

 Set<String> smallset = new HashSet();
 smallset.add("Color");
 smallset.add("Crayon");
 System.out.println("BEFORE: smallset set size: " 
             + smallset.size());


 Set<String> originalKeys = original.keySet();
 originalKeys.retainAll(smallset);
 System.out.println("AFTER: original map size: " 
    + original.size());
 System.out.println("AFTER: smallset set size: " 
   + smallset.size());
 }
}

In short, there is an original HashMap which contains six different keys and there is a small set of values with which we want the intersection from original Map. We get the keySet from the original map and call retainAll on the same. Idea is to get the intersection of the map keys and the small set. Output of this program is:

mawasthi@mawasthi-1:~/scratch$ java -cp . A
BEFORE: original map size: 6
BEFORE: smallset set size: 2
AFTER: original map size: 2
AFTER: smallset set size: 2

So you get it – right? Since we used keySet() on Map, since Java returns references and since retainAll has a side-effect of modifying the original map and hence our original map becomes as short as passed small set.

How to fix this?

1) Clone the Map.keySet() output. OR
2) Create a new HashSet and do a addAll of Map.keySet to it.

Idea in (1) and (2) is to create a copy.

May be this is because I’m coming from the C/C++ background (who is in love with the concepts of functional programming aka no side-effects) that I hate the arguments getting modified in any manner within a function.

Advertisements

Understanding unicode in python and writing text in devanagri script

This Unicode HOWTO by Python Software Foundation is a short but informative read about Unicode handling in python. Understanding the brief history of characters standardization from ASCII to Unicode is important not only when you’re working on the stuff that needs it but also since more and more services interact with each other over Web and internationalization (or localization or l10n) is omnipresent.

In python (v2.7) playing around with unicode is super easy.

A normal string in python:

$ s = "abcde"
$ type(s)
str 

A unicode string in python:

$ s = unicode("abcde")
$ type(s)
unicode 

A character is the smallest possible component of a text. ‘A’, ‘B’, ‘C’, etc., are all different characters. So are ‘È’ and ‘Í’. The Unicode standard describes how characters are represented by code points. A code point is an integer value, usually denoted in base 16. In the standard, a code point is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal). Strictly, these definitions imply that it’s meaningless to say ‘this is character U+12ca’. U+12ca is a code point, which represents some particular character; in this case, it represents the character ‘ETHIOPIC SYLLABLE WI’. In informal contexts, this distinction between code points and characters will sometimes be forgotten.

So, a Unicode string is a sequence of code points, which are numbers from 0 to 0x10ffff.

The rules for translating a Unicode string into a sequence of bytes are called an encoding. UTF-8 is one of the most commonly used encodings. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit numbers are used in the encoding. (There’s also a UTF-16 encoding, but it’s less frequently used than UTF-8.) UTF-8 uses the following rules:

  • If the code point is <128, it’s represented by the corresponding byte value.
  • If the code point is between 128 and 0x7ff, it’s turned into two byte values between 128 and 255.
  • Code points >0x7ff are turned into three- or four-byte sequences, where each byte of the sequence is between 128 and 255.

UTF-8 has several convenient properties:

  • It can handle any Unicode code point.
  • A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes.
  • A string of ASCII text is also valid UTF-8 text.
  • UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.

That was some important and interesting theory.

Now out of curiosity I wanted to print Devanagari script characters (Hindi characters to be precise). So I searched over google for Devanagari characters unicode list and found this page from unicode.org comprising of all devanagari script unicode values.

first_name = u'\u092e' + u'\u0928' + u'\u094b' + u'\u091c'
last_name = u'\u0905' + u'\u0935' + u'\u0938' + u'\u094d' \
+ u'\u0925' + u'\u0940'

print first_name + ' ' + last_name

This prints the devanagari characters (basically my name). Output on a ipython notebook is clear and on terminal bit unclear. Looks beautiful.

मनोज अवस्थी

Python Error “TypeError ‘float’ object not callable”

If you see this error in python: 

TypeError ‘float’ object not callable

It seems that you are trying to call a method or function and a property or variable with the same name is available in the script. Check for the same. If it was a mistake to use the variable with same name as method – fix it by renaming the variable OR if this property is what you want to access and not call any function then remove parentheses `()’ after property. 

 

 

 

java.lang.NoSuchMethodError while running Mahout example

Yesterday I was setting up Apache Mahout for work related stuff. After downloading, unzipping and untarring I installed using  –
(version 0.7, 0.8 and 0.9 [trunk] all gave the same issue which I am addressing in this post)

mvn -DskipTests clean install

After setting up some environment variables like MAHOUT_HOME I was ready to go for running some machine learning algorithms on some data. I chose Quick Start Guide on the Mahout site. I downloaded the sample data, copied it to Hadoop (cloudera distribution 4.3.0 cdh4) HDFS as instructed on the quick start page. After this, I ran kMeans algorithm on the sample data copied in HDFS:

$MAHOUT_HOME/bin/mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

It failed with following error:

Exception in thread “main” java.lang.NoSuchMethodError:
org.apache.hadoop.util.ProgramDriver.driver([Ljava/lang/String;)V
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:192)*

I searched on Google and found many many many results. All pointing to some kind of version mismatch between Hadoop and Mahout OR some issue with CLASSPATH or some thing else. I could not really resolve the issue since was not flexible to change the version of Hadoop and was receiving this error for Mahout v0.7, 0.8 and svn trunk (0.9).

Finally, I ran the algorithm like this (replacing `bin/mahout` with `hadoop -jar apache-mahout-version-job.jar.. `)

$HADOOP_HOME/bin/hadoop -jar $MAHOUT_HOME/examples/target/apache-mahout-0.7-job.jar org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

And this worked.

Looks like a classpath issue. Hadoop was not able to get access to Mahout 0.7 job JAR file and hence was throwing NoSuchMethodError. I didn’t have much progress resolving the CLASSPATH issue but was happy to get the map-reduce job running and getting the output which I finally copied back to file system.

BTW – this alternative is equivalent of running mahout shell script – so not that you miss anything.

Also note that if you are using Mahout version 0.7 – there is a bug in the script $MAHOUT_HOME/bin/mahout at line number 224 (around that) – read this:

> bin/mahout throws NoClassDefFoundError: org/apache/hadoop/util/ProgramDriver
> I think this is due to line 224
> CLASSPATH=”${CLASSPATH}:${MAHOUT_HOME/lib/hadoop/*}”
> I fix the issue by moving the closing brace
> CLASSPATH=”${CLASSPATH}:${MAHOUT_HOME}/lib/hadoop/*”

You can fix it locally. This issue is fixed in 0.8.

Writing “Hello World” program!

Sometimes we get questions which are silliest but have to attempt since the situation demands it (interviews, presentations, professor’s lecture..). One of those silly (of almost “0” importance and usability in life) questions is “How many ways can you write a ‘Hello, World!’ program in C/C++?”.

For those, who want me to write some specification of the program – task is to write a program to print on console a string “Hello, World!”. Well, lots and lots of answers —

int main() { printf(“Hello, World!\n”); } /* C Example */

int main() { std::cout << “Hello, World!” << endl; } // C++ Example

int main() { if(printf(“Hello, World!\n”)) {} } // C; without semicolon

int main() { while(!printf(“Hello, World!\n”)) {} } // C; without semicolon

following, I came through while re-reading stroustrup is writing a program using iterators (Chapter3;Iterators and I/O). This is nothing new and just a iteratorized manifestation of example II above.. but yes, different clothing!

ostream_iterator<string> oo(cout);
int main() { *oo = “Hello, “; ++oo; *oo = “World!\n”; }

Modifying built-in classes

Inheritance is a powerful idea in object oriented programming. This lets you add new enhancements to existing classes. Well, not exactly. You have to create a new data type derived from the existing classes and use these for additional enhancements.

For example, if we want to add a function palindrome to String class. Solution existing in all (most?) object oriented programming languages is to create a class say Word derived from String and add the function to this class. In Ruby for example it would look like:

Note – replace ‘LT’ with corresponding symbol for less than. HTML is messing it up and I feeling lazy to find the &; version for it.

 
 
class Word 'LT' String 
def palindrome? 
   self == self.reverse 
end
end 
... 
irb> w = Word.new("level")
irb> w.palindrome? 
true
irb> w1 = Word.new("simple")
irb> w1.palindrome? 
false 
 

What we lose here is that we CANNOT call the function on String objects e.g., following is invalid:

 
irb> "level".palindrome? 
NoMethodError: undefined method `palindrome?' for "level":String
 

Ruby, quite surprisingly (and did i mention it never fails to amaze!) provides this. It lets you modify the built-in classes. So we can add a function simply by following:

 
 
class String 
def palindrome? 
   self == self.reverse 
end
end 
 

and it gets your String more powerful!

 
 
irb> "level".palindrome? 
true
 

Isn’t that coooooool??

Ruby require error in loading gems

Long time that I coded in Ruby so thought lets replenish the love.

I started reading Graph APIs from Facebook to get some idea of what capabilities they provide with. Graph APIs are immensely powerful in the kind of data they allow developers to access. No doubt how such powerful ecosystem got created. Graph API responses are in JSON (Javascript Object Notation). I have little familiarity with this representation of data. Coming from old school I’ve mostly played with XML data. So I thought lets first try practicing JSON on Ruby. So first step –

$ gem install json

It was easily installed.

Then I wrote the script to load JSON gem in my script i.e.

require ‘json’

and just to test if this is fine I executed it (being almost sure that it will and I will move on). Ruby interpreter yelled!

$ ruby temp.rb
temp.rb:1:in `require’: no such file to load — json (LoadError)
from temp.rb:1

Hmm. so again. Some googling helped (and refreshed some memory) –

There are three ways to handle this –

1. Put following line to the starting of the script

require ‘rubygems’

2. Execute it as:

$ ruby -rubygems script.rb

3. Add rubygems to RUBYOPT

$ export RUBYOPT=”rubygems”

Well, just now I read this article on Github and it clearly says:

You should never do this in a source file included with your library,
app, or tests:

require ‘rubygems’

Why You Shouldn’t Force Rubygems On People!

So guys, (2) and (3) are the way to go and (1) is to be avoided if you plan to share your Ruby script.

Facebook open sources its C++ library

Facebook has announced to open source its C++ library named folly and made it available via Github.

Library is C++11 Components and is claimed to be highly usable and fast. In fact, their introduction page particularly focuses on the performance part of the library. Folly has been tested with gcc 4.6 on 64 bit installations.

I just downloaded the library for a quick look and it should be interesting to peek into the code.

You can also have a look here: https://github.com/facebook/folly/blob/master/folly/docs/Overview.md