Monday, March 17

Get started with Scrapy

As usual I setup the virtual environment and just pip install packages: lxml and Scrapy, lxml being the prerequisite. But this time it failed with following error:


It seems it tried to build lxml with packages libxml2 and libxsit from anaconda. Anaconda is a python distribution I installed to my machine, enabling easy installation of many scientific packages like scipy, matplotlib and pandas. It is strange because in virtual environment the python distribution should be the one within the environment.

After poking around, I found I had following line in my ~/.bash_profile:
# added by Anaconda 1.8.0 installer
export PATH="//anaconda/bin:$PATH"

It looks like to be added automatically when I installed Anaconda, making the python distribution in Anaconda the default one. In my virtual environment, where the supposed python distribution should be the virtual one, when I install packages, it still look for the system default one. By removing the line from bash profile, it seems fine now.

[TODO]


Monday, March 3

Basic mongodb administration

Recently picked up mongodb for the ongoing project. It is a popular NoSQL database; I prefer to call it schema-less database, and that is also why I use it: the data goes into the database does not have to follow any predefined schema like they do in SQL ones. This is perfect for any exploratory data analysis like we do when we have to constantly add/delete items/attributes for your data. Previously I use SQL database, in which I have to rebuild the database every time if any new stuff is added; it is painful.

In this post I include a few commands to quickly get you started with regular administration with mongodb.
mongod --dbpath /srv/mongodb/

This initiates the mongodb server with predefined local database path.
mongodump --dbpath /data/db/ --out /data/backup/
This uses mongodump creates a backup for the entire database you point to.
mongorestore --dbpath <database path> <path to the backup>
This restores/dumps a backup copy into defined database.

I will update once I come across necessary commands.

Thursday, May 23

Basic of R

Objects - variables

use c() to create a collection of data and assign them to a variable.

metallicaNames<-c("Lars", "James", "Kirk", "Rob")

Dataframes

A dataframe is similar to a spreadsheet, an object that containing several objects. To combine different objects.

metallica<-data.frame(Name=metallicaNames, Age=metallicaAges)

To add column to a dataframe

metallica$chidAge<-c(12, 12, 4, 6)

To see the column names of a dataframe

names(metallica)

Create a list: a list of separate objects


> metallica2<-list(metallicaNames, metallicaAges)
> metallica2
[[1]]
[1] "Lars"  "James" "Kirk"  "Rob"

[[2]]
[1] 47 47 48 46

> metallica2[1]
[[1]]
[1] "Lars"  "James" "Kirk"  "Rob"

> metallica2[2]
[[1]]
[1] 47 47 48 46

Dates

To create date type objects

birth_data<-as.Date(c("1977-07-03", "1969-05-24", "1973-06-21", "1970-07-16", "1949-10-10", "1983-11-05", "1987-10-08", "1989-09-16", "1973-05-20", "1984-11-12"))

Coding variable

It is used to indicate different groups for participants, such as "Tablet" and "Phone" group. First we create collection of different numbers to indicate the different group, and then assign them to corresponding factors using factor().


> job<-c(rep(1, 5), rep(2, 5))
> job
 [1] 1 1 1 1 1 2 2 2 2 2
> job<-factor(job, levels=c(1:2), labels=c("Lecturer", "Student"))
> job
 [1] Lecturer Lecturer Lecturer Lecturer Lecturer Student  Student  Student
 [9] Student  Student
Levels: Lecturer Student

Alternative to create coding variables

job<-gl(2, 5, labels=c("Lecturer", "Student"))

Importing data

csv -> dataframe
lecturerData2 = read.csv("Lecturer Data.dat", header=TRUE)

dat or txt -> dataframe
lecturerData2<-read.delim("Lecturer Data.dat", header=TRUE)

to navigate to different directory we could use setwd("xx/xx")

Manipulating data

select a part of data

newDataf <- oldDataf[rows, frames]

lecturerPersonality <- lecturerData[, c("friends", "alcohol", "neurotic")]
lecturerOnly <- lecturerData[job=="Lecturer",]
alcoholPersonality <- lecturerData[alcohol > 10, c("friends", "alcohol", "neurotic")]

stack the dataframe (wide -> long)

select columns to be stacked on top of each other


satisfactionStacked<-stack(satisfactionData, select=c("Satisfaction_Base", "Satisfaction_6_Months", "Satisfaction_12_Months", "Satisfaction_18_Months"))



Sunday, October 7

Measure cache parameters

1. to use gettimeofday()
Note from unix manual, if you use it in C, you have to add "struct" before "timeval xx". Also, include <sys/time.h> instead of <time.h>, because the former is where the struct timeval get defined.

2.

Tuesday, October 2

Step by step https client/server building

Server side with apache, mysql and python scripting on ec2 ubuntu server 12.04; client side with android 2.3 and sqlite built-in. Also, both of them have secure tls connection ability.

I will go directly into the topic. Start with server side. For a basic http server on ec2 using apache, refer here. Note in this article the environment is actually different, instead of using source code to build the apache, I install apache directly using command apt-get install apache2. It surprisingly takes good care of all details and works well at least for now. The configuration is different in these two installation and I suggest you use apt-get.

When done installing, apache is already up to go. Default http configuration file is located at /etc/apache2/sites-available/default. If you fire up http://localhost you should be seeing "It works" page. This is page, index.html is located at /var/www, it serves as your site, where you could put all html files in and if your computer has an ip address, others could see your site by accessing the ip address. Another easier way to test if the server is working well is to use curl command, curl http://localhost and it will return the response of the server, which in this case the default index.html. Curl is easier to use when you want to test the response of server, you don't need any other clients to fire the request.

Now let's go into tls. I assume you already have the key file and cert file on your server. Put them into /etc/ssl/private and /etc/ssl/certs respectively, they are the default dir apache is looking at for key and cert files. Then follow this excellent doc to setup ssl module for apache. There is a default ssl configuration file you could customize: /etc/apache2/sites-available/default-ssl, it includes file directories of ssl request and so on. The default dir is the same as http connection, which is /var/www. If you put a different index.html in it, when you test using curl -k -3 https://localhost, k means accessing without any cert files and 3 is the version number of ssl protocol you are gonna use. This will give this page so that you know you are in a https connection.

Ok, now we have this ssl server working pretty well. There is one more step to go on the server side which is add the handler to deal with different requests. Now all we could request is the default page. We want more. Particularly, I need a handler that takes in a POST method, extract its data, and then put them into a table in mysql db on the server.

First to install mysql. apt-get install mysql-server mysql-client. Note, in ec2, you could directly sudo su, without typing in any password, go into root. just to make things easier, cause most of configurations and commands here need to root. Then, I install mysql python interface, python-mysqldb, you could install whatever you like, php, etc.

Then we will see how to use python script to handle http/https request. We use CGI (Common gateway interface); it is a way to make executable file like scripts request-able at client side. The default dir for cgi scripts is /usr/lib/cgi-bin. Put your scripts there and they should be immediately up for http request. Here is my echo python script:

Basically for a POST request with several key/value pairs, it will print out # of pairs and every pair. It uses a python module called cgi, and cgitb is another module to enable debugging function. Note the line 9 is necessary because it tells the server and client this is valid html text, otherwise client would probably throw out "invalid response" error. In fact CGI is not the best way to do script request, it is highly unstable when scripts get complex. But it is the easiest way to get it going. Now curl --data "key1=value1&key2=value2" https://localhost/cgi-bin/yourscript.py it will and should return the result of the script. Note --data is how you send POST request via curl.

To this point, the server setup should reach a happy ending. Now let's look at the client side. The very first thing you need to do is, in your android project, be sure to include the cert file of your server, maybe at /res/raw; it is required in tls connection. Details could be found here.

Assuming you know how to use httpClient in android, you should be already connecting your server and client. Have fun!






Monday, September 24

Apache http server on ec2

Now I get a little bit involved with server side. Have to setup a server for our projects. While the machine is not ready, I decide to first try setting up the server at aws ec2 platform to just have a taste.

I choose Apache because I know nothing about running a server and also this is the word I heard most of the time when people talk about server stuff. It is totally not hard to install it but there are some wrong turns that I took, turns that some guides would be appreciated.

Therefore, following is the walk to setup httpd on ec2.

First package to download is of course the apache httpd itself, 2.4.3 to-date. Extract it into a folder say ~/httpd. Note there is a subfolder named ~/httpd/srclib, which will be used later.

Second we got APR and APR-Util. They are required for httpd installation, which ec2 doesnt have (at least ubuntu server doesnt). Extract them and put them into subfolders ~/httpd/srclib/apr and ~/httpd/srclib/apr-util respectively. This tells httpd to install them along the way if the system does not have them already.

One side note, during the install process the system would probably ask you for root password, which you dont have if you are using a ec2 instance. Dont worry, just set it: use command sudo passwd root. Set your password you are good to go.

Before going into installation, install PCRE (Perl-Compatible Regular Expression Lib). If you are using the same server as me, just type sudo apt-get install libpcre3-dev.

Now do the old trick: ./configure, make, sudo make install. Note add --with-included-apr in ./configure so that it will look at srclib we prepare for it for apr and apr-util.

The make and install commands will take some time, so relax and waste your time on some stupid videos, like this one, which I quite like.

After installation, use apachectl -k start and apachectl -k stop to test the server. If you install correctly, when you start the server, issue curl http://localhost.com will get you the 'it works' html page, which tells you everything is good. Use locate if you cannot find apachectl.

Thanks for watching. I am talking about the video...

Friday, September 7

http/tls connection in python, android, EC2

Have a side project to build a standard tls package for the team. Ive never tried socket, so start with python just to get a feel. Following is my own experiment, to connect the server side code on EC2, and the client side code on my local laptop.

1. simple http
I use sample code from official doc. It is really simple. All you need to do other than code is configure the port for EC2 instance.

For the instance you are running, configure its security group so that the specific port you want to communicate on is open like 2727 above.

2. simple https
Things get rough with security. Basically, what I know about https, i.e. tls, is that it utilizes a public key identification system to secure the communication via http. The server has a private key, which is only known to itself. It also has a corresponding public key ready to distribute to anyone need to communicate with it. In order for the other side to trust it, the server has to have its public key certified by trusted 3rd party, called a Certificate Authority. Same with client side.

However, if we just want a connection between our own server and client, we could generate keys and certificates ourselves without paying for CA cert file. This is called self-signed certificate, or root CA certificate.


openssl req -new -x509 -days 365 -nodes -out cert.pem -keyout cert.pem


If you have openssl installed on your computer, you could use it to generate keys. In this case, I generate private key and certificate in the same file. Then I just copy it to the other side. Both sides use the same keys. Things are simpler here, for which most cases you might wanna use a more secure authority to certify for you.

Then both sides I use sample scripts from official doc again. Note for https connection you also have to open the port for ec2 instance.

3. simple https with android
With android things are bit complex with certificates. I have this cert.pem file, which is not enough for android. Bouncy Castle encryption is supported well by android, which is the one we are gonna use to generate client side key file.

First is to install Bouncy Castle. Note android is using a different version of it, version 145, not 146 from official site. Find one, download it, a jar file. Put it in the directory '/usr/libexec/java_home/lib/ext', where on mac should be '/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/lib/ext'. Second, add following sentence into the jave.security file also located in lib folder:


Having keytool in your machine, do following with the cert.pem file:
Now you will have mykeystore.bks file in raw directory. I here use a der file because android returns 'wrong version of certificate' error. To generate der file from pem:


We are almost done here. Just grab any sample code for https connection in android, using whether httpURLconnection or httpclient, put correct password and file name into place, everything should be fine now.