Monday, March 17

Get started with Scrapy

As usual I setup the virtual environment and just pip install packages: lxml and Scrapy, lxml being the prerequisite. But this time it failed with following error:


It seems it tried to build lxml with packages libxml2 and libxsit from anaconda. Anaconda is a python distribution I installed to my machine, enabling easy installation of many scientific packages like scipy, matplotlib and pandas. It is strange because in virtual environment the python distribution should be the one within the environment.

After poking around, I found I had following line in my ~/.bash_profile:
# added by Anaconda 1.8.0 installer
export PATH="//anaconda/bin:$PATH"

It looks like to be added automatically when I installed Anaconda, making the python distribution in Anaconda the default one. In my virtual environment, where the supposed python distribution should be the virtual one, when I install packages, it still look for the system default one. By removing the line from bash profile, it seems fine now.

[TODO]


Monday, March 3

Basic mongodb administration

Recently picked up mongodb for the ongoing project. It is a popular NoSQL database; I prefer to call it schema-less database, and that is also why I use it: the data goes into the database does not have to follow any predefined schema like they do in SQL ones. This is perfect for any exploratory data analysis like we do when we have to constantly add/delete items/attributes for your data. Previously I use SQL database, in which I have to rebuild the database every time if any new stuff is added; it is painful.

In this post I include a few commands to quickly get you started with regular administration with mongodb.
mongod --dbpath /srv/mongodb/

This initiates the mongodb server with predefined local database path.
mongodump --dbpath /data/db/ --out /data/backup/
This uses mongodump creates a backup for the entire database you point to.
mongorestore --dbpath <database path> <path to the backup>
This restores/dumps a backup copy into defined database.

I will update once I come across necessary commands.