Data Science from Nepal
Its hard to miss “Data Science” or “Big Data” as the two hot topics at present. Wikipedia defines data science in the simplest terms as the science of extracting knowledge from data. The vast potential of data science applications is driving the job market and proportional investment from big companies. Consequently startups working in the area of data science ahave been mushrooming around the world.
Nepal hasn’t remaind untouched by the growing interest in data science. I have been following a string of startups from Nepal working in data science. These are exciting times for startup scenario in Nepal, and it is encouraging to see people experimenting with data science in their startup venture. Here I am listing some startups that I have been following:
- Oval Analytics – Your Data Science Partner
Oval Analytics is the brainchild of Hemanta Shrestha and Saurav Dhungana. Oval is perhaps the first technology company in Nepal with aim to provide data analytics services to local clients in addition to external clients. This is a challenging task given the limited market within the country. Oval Analytics wants to become an important part of the data science community in the country.
- Data Nepal – Nepal Unleashed
DataNepal was a startup with an aim to become the goto repository for “socio-economic, demographic, environmental, developmental and geospatial data “ related to Nepal. The data was mainly collected from public domain and made available in more friendly formats (JSON, CSV, XML).
- Graph Nepal
Graph Nepal is perhaps the first startup with focus on data visualization and infographics focused on local issues. Visualization is a powerful part of conveying the story based on big data analytics.
- Kathmandu Living Labs
- Cloud Factory
Let me know @sauravrt, about other startups from Nepal who are working in the area of data science, analytics and visualization. I’d be happy to know more of them and add to my list here.
Written with StackEdit.
I recently visited Washington DC, the country’s capital, with my wife and some friends. It was a three day visit over the Memorial day weekend. This was my second time in the capital. A combination of perfect weather and good company made this a memorable trip for us.
Travel: Our plan was to drive all the way to DC. On a normal traffic it should take us around 8 hrs to reach DC from Boston. We rented a car big enough to fit six people. We had three of us who could share the drive. Also we planned to use Waze app and Garmin GPS with live traffic to keep our eye on traffic condition ahead.
Lodging: We decided to try out Airbnb for our stay in DC. After couple of days of collaborative search we were able to find a host who would take in six guests. The host had good reviews from previous guests and place was located behind the US Naval Observatory . So we felt pretty confident about the host and neighborhood.
Day 0 ( May 23, 2014)
The reservation for the rented car had a pick up time of 12 pm, but a call to customer service early in the morning confirmed that we could pick up the car earlier. So three of us who would be driving set off towards the Logan airport where the rental car was located. After quick negotiation we were able to get a slightly bigger car (Chevy Suburban instead of Tahoe). On the hindsight we are glad we made that choice. The Suburban was plenty spacious of six people and had enough luggage space too.
We drove back to our apartment where we loaded our luggage onto the car and by 12 pm we were on the road. Our plan was to head out by 12 pm so that we would reach DC by 9 pm in the evening. So we were pretty pleased with our organization. We took I-90 W all the way to Sturbridge and split off to I-84 to head down south. We were worried that we would hit New York City evening rush hour traffic.
My work was presented at the recently concluded IEEE SSP2012 (Michigan). It was a poster titled Approximate Eigenvalue Distribution of a Cylindrically Isotropic Noise Sample Covariance Matrix. This work was done in collaboration with my adviser Prof. John R. Buck and Prof. Kathleen E. Wage from GMU.
Unfortunately, due to my internship commitments I wasn’t able to attend the actual conference. My adviser presented the poster on my behalf.
traceroute is a wonderful tool to analyse network structure. According to its man page,
traceroute tracks the route packets taken from an IP network on their way to a given host. It utilizes the IP protocol’s time to live (TTL) field and attempts to elicit an ICMP TIME_EXCEEDED response from each gateway along the path to the host.
From time to time, I like to run my traceroute to explore how I connect to different websites via my ISP’s network. It is specially interesting for me to to traceroute tests from US to servers hosted in Nepal, as I have some idea about how internet traffic flows in/out of Nepal. Today I’ll present results on traceroute test to Nepal Telecom’s (NT) website. Since NT had optical fiber links through to India and beyond, it will be interesting to see which links are utilized for a packet to reach from US to NT.
- Hops 1 – 9 , the packets are still in US
- The trace starts from my router and goes into Comcast network and to BOS (Boston) in hop 4 and comes down to NYC ( New York City ) in hop 7 via routers at Woburn and Needham MA in hos 5 and 6 respectively
- At NYC, the packet drops off from Comcast network to L3’s 10 Gigabit ethernet links. Since NYC is the main landing site for Trans-Atlantic optical fiber cables coming ashore east coast in US, it is expected that the packet going out to Nepal would also follow the same path.
- From NYC , the next hop(9) is to Airtel in India via L3’s 10Gigabit link
- Among several telecom operators in India, NT has bought the largest bandwidth with Airtel, so it makes sense that the route via Airtel’s network is most viable one.
- At hop 11, the packet reached India. This is evident from the jump in round trip delay to ~400ms which translates to ~ 11,000 km. The fiber landing site is most likely Mumbai, India.
- From there on, the packet enters Nepal at hop 12. The router IP 202.70.x.x belongs to NT. The router at hope 12 is a Border Gateway router, most probably at Bhairahawa where most of NT connection goes through to India.
- The the packet goes through Butwal to Pokhara (pkr.btw) in hope 13
- From Pokhara, the packet reaches NT’s Intn’l Exchange Bldg at Patan on hop 14. From there on the packet finally reaches the webserver at NT’s central office at Bhadrakali.
This was a traceroute analysis through Comcast network. Next I’ll do the same kind of analysis for trace through Verizon DSL network.
If you are one of those who jumped onto the internet bandwagon from the days of dial-up connection, then you must be familiar with a sequence of sounds the modem makes before the connection was established with the ISP. I started my internet surfing days listening to that peculiar sound. Back in those days I had a vague idea that the modem was actually transferring data over the copper pair line that was only being used for voice before that. But I was I didnt’ know what the strange sequence of sound was.
V.34 is the standard protocol recommended by ITU for modems operating on legacy copper pair. The V.34 allows upto 33.8 kbit/s bidirectional data transfer. ( Refer to Wikipedia for more )
Today I came across a recording of the V.34 dialup modem startup signalling audio sequence (here) and I decided to take a look at its spectral content. The figure below shows the temporal and spectrogram plot of ~18s of of signalling sequence. ( The total startup time for V.34 modem is about 10 – 13s)
Then I looked up the start up signalling sequence for V.34 protocol and found this paper. Briefly the startup signalling involves four phases which can be summarized infollowing steps ( focus on frequency content of signalling signals ):
Phase I ( Network interaction )
- A 2100Hz answer tone modulated with 15Hz sine wave is exchanged. ( The 15Hz modulated sine wave is not distinct in the spectrogram, but I will take faith on the specification for V.34 that it is present )
Phase II ( Ranging and probing )
- This phase involves three steps : Initial information exchange [INFO0], Probing & Rangin and a second information exchange [INFO2]
- The information exchange is done at 600bps using DPSK modulated FDM tones at 1200Hz and 2400Hz
- Probing is used to estimate channel characteristic. The probing signals consists of set of tones 150Hz apart starting from 150Hz to 3750Hz. However, tones at 900, 1200, 1800 and 2400Hz are omitted.
Phase III ( Equalize and training )
- This phase consists of a series of signals transmitted between the calling and the answering modem. The exchange consists of a sequence of scrambled binary 1s for fine tuning of the equalizer and echo canceller, and a repeating 16-bit scrambled sequence indicating the constellation size that will be used during. These scrambled sequences are transmitted using a four-point constellation. The scrambled sequence occupies the entire channel bandwidth.
Phase IV (Final duplex training )
- This phase consists of a sequence of scrambled binary 1s using either a 4- or 16-point QAM constellation.
I have tired to identify these sequence of events in the spectrogram above ( Larger version ). All of the signalling sequences listed above can be identified in the spectrogram. There was at least two set of signalling tones that I could not associate with the specification on V.34 protocol.
( Spent couple of hours this afternoon doing this exercise. Coming around more than 10 years after the days of dialup, this was a nice trip down the memory lane and moreover I can see what was going on the scene everytime my modem dialed up to the ISP )
Abstract Algebra was one of the mathematics course I took for the Fall semester which has just ended. The complete course has two parts taught over two semesters. I took the first part and it mainly covered some basic Number Theory and largely Group Theory. As a part of the project I had to a class project and my choice was RSA : A Public Key Cryptography Algorithm. The strength of RSA is based on difficulty in factoring large integers, specially those formed as product of two integers. The algorithm uses Number Theory concepts of modulo exponentiation, the Euler’s function and the decryption is based on Euler’s theorem. My objectives were to study the algorithm itself and do a simple implementation.
In public key cryptography, the key has a public part and a private part. The public part is made known to everybody where as the private part is kept secret by the receiver ( My PGP public key ). Anyone who intends to send a message to the receiver encrypts the plaintext using the public key corresponding to the receiver. Once encrypted using the public key, the ciphertext can only be decrypted using the private key, which is safe with the receiver.
RSA is a public key cryptography algorithm jointly developed by R. Rivest, A. Shamir and L. Adleman and it was described in a paper in 1978. The name of the algorithm comprises of the first letters of the three authors surnames. The algorithm was originally patented by M.I.T. but was released to public domain in September 2000. The algorithm has three steps (1) Key generation (2) Encryption (3) Decryption.
The RSA key pair is generated as follows
* Generate a pair of prime numbers $latex p$ and
* Compute $latex n = pq$$
* Compute the Euler’s function
* Find an integer $e$ such that and is coprime with i.e. $gcd(e,\phi(n)) = 1$.
* Find another integer such that . This is determined using extended Euclidean algorithm which gives where $k$ is some integer.
The public key consists of the pair and the private key consists of the pair .
Encryption and Decryption
RSA algorithm uses modulo exponentiation operation for both encryption and decryption. The plaintext is first converted to numeric codes before they are encrypted. For instance, the letters in the plaintext are represented as integers for example ‘a’ = 00, ‘b’ = 01 ‘z’ = 25. Once the plaintext is represented by numeric codes the ciphertext is generated as
The receiver decrypts the ciphertext using modulo exponentiation operation with private key pair as
The decryption works as follows:
Now according to the Fermat’s Little theorem, for any integer x and prime number p (which is not a factor of x), . Also by definition of the Euler’s function . Thus
This is true even when . Following similar argument for the prime number q,
Combining above two equations according to the Chinese Remainder Theorem, we get
( A complete explanation is available in the original paper )
As a second part of the project, I implemented a simple version of RSA algorithm in Python . The program can generate an RSA public and private key pair, encrypt a plaintext string and recover original message from the ciphertext. The keys generated are eight digits long. The plaintext can be a string ( Roman alphabets only for now, no special characters ). The program can be downloaded here.
We acquired a USRP N210 unit from Ettus Research. It was planned to explore MIMO and Radar concepts by implementing simple algorithms on the USRP device. However when we got the device after few months of its release and back then the USRP was still plagued with multiple bugs ( including one in the firmware ). It was no easy task to setup working environment on an Ubuntu system. We came across several problems during installation. I found only one blog detailing the step by step installation ( and workaround ) then http://www.raullen.net/2011/02/20/hello-usrp-n210-how-to-make-usrp-n210-running/. Even with the help from this blog, I was unable to setup the working environment in my Ubuntu machine and unfortunately I had to abandon the project then.
A year later, I have decided I want to get the USRP up and running so I can do some cool stuff besides some abstract mathematics. I set out to install the device on a new system ( Ubuntu 10.04 LTS ). And it turns out that Ettus has done a good job of providing a much more detailed documentation on setting up the N210. Here I have made a list of topic and related links that a newbie may encounter when starting with the N210 USRP or in general N-series USRP from Ettus.
The safest / easiest way to setup Gnuradio with UHD environment on Ubuntu is to use the build-gnuradio script:
N210 has issues with the pre-installed firmware and the FPGA code and hence it needs to be updated before the PC can talk with it. The firmware and FPGA images can be downloaded from here
Front panel LEDs
The LEDs on the front panel can be useful in debugging hardware and software issues. The LEDs reveal the following about the state of the device:
- LED A: transmitting
- LED B: mimo cable link
- LED C: receiving
- LED D: firmware loaded
- LED E: reference lock
- LED F: CPLD loaded