SEA-final-project
Building up movie seach engine plus customized recommendation system
/constants
Could be empty, but since Rotten Tomato API has usage limit per day, we suggest not to run the crawler everytime.
After crawler
With files in following link, you can fire up the server immediately
Ready to serve
backup
google drive
Run
0. Run crawler
1. Split data into many partitions
python -m src.reformatter <# of partitions for review> <# of partitions for movie>
python -m src.reformatter 4 5
2. Call prework workers
python -m PreworkWorkers
3. Spin up fs system
python -m fsStart
4. Prepare data for servers
python -m PrepareFS
# Note, the num of partitions should corresping to the num of backend works
# Default: (NumSuperFront, NumMaster, NumMovie, NumReview, NumIdx, NumDoc)= (1, 3, 3, 3, 3, 3)
5. Start All the works
Goal: 1. find ports, 2. fire up all servers
python ./StartAll.py
6. Fire up frontend
Need to install Google App Engine SDK first: https://cloud.google.com/sdk/#Quick_Start
dev_appserver.py --host=localhost --port=8080 frontend
# then the frontend server runs at port 8080
7. Try it (in browser)
following the above example
http://127.0.0.1:8080/
Structure:
The structure of fired-uped HTTP servers are:
--> classifier_front(?) --> ?
User --> SuperFront --> searchEng_front --> searchEng_worker (inclusing IndexServer*3, and DocumentServer*3)
--> recom_front --> recom_worker (inclusing MovieServer*3, and ReviewServer*3)
Recommendation System:
Goal: getting the user ID --> check user log to get review history --> check MovieServer to get similar critics --> check ReviewServer to get movies sorted by weighted rating
Stucture and Usage:
recom_front --> MovieServer*3
--> ReviewServer*3
#recom_front api:
#http://linserv2.cims.nyu.edu:46829/recom?user=UserID (e.g. http://linserv2.cims.nyu.edu:46829/recom?user=d0aa6e9b-676b-428f-9758-65e7c09b38a4)
#MovieServer api:
# http://linserv2.cims.nyu.edu:46831/movie?movieID=MovieIDs (e.g. http://linserv2.cims.nyu.edu:46831/movie?movieID=770802394+770882996+12900+13217+11705+770876740+770710325+771362322+533693794+348462568)
#ReviewServer api:
#http://linserv2.cims.nyu.edu:46834/review?critics=CRITICS (e.g. http://linserv2.cims.nyu.edu:46834/review?critics=Emanuel_Levy+Roger_Ebert)
Current UserLog is created by:
python ./src/createFakeUserLog.py
#So it will create 20 reviews per user with random scoring on random movie. Total for 50 users with unique ID created.
#saved at ../userLog/myUserBook
TomatoCrawler
Goal: to fetch rotten tomato website and save the info properly
Now we have:
- 250 movie to search
- 1718 movieIDs returned
#If you like tomatoCrawler to save Movie_fs, Review_fs, and IDs_fs to file system
from src import tomatoCrawler
tomatoCrawler.main2FS()
#Or! just ask tomatoCrawler to save Movie_dict, Review_fs, and IDs_fs to ./constants as pickle files
tomatoCrawler.main2NormalDict()
File System module Usage
Distributed dictionary object
from fs import DisTable
#Creating an object
a = DisTable()
# or
b = DisTable({ 1: 'a', 2: 'b', 3: 'c'})
#Set a key-value pair
a[1] = 'a'
a[2] = 'b'
#Get a value with key
a[1]
#returns 'a'
#Pop operation
a.pop(1)
#returns 'a' and remove (1, 'a') from dictionary
#hasKey operation
a.hasKey(2)
#returns True
a.hasKey(1)
#returns False
#Length property
a.length
#returns 1
#Pretty print of dictionary
print a
#1
# a
'''
key1
value1
value2
...
key2
value1
value2
...
'''
Distributed List
from fs import DisList
#Creating an object
a = DisList()
# or
b = DisList([1, 2, 3, 4])
#Append/Extend a value into list
a.append(1)
a.append(2)
a.extend(3)
a.extend(4)
#Get a value given position
a[0]
#returns 1
a[1]
#returns 2
#Update value to given position
a[1] = 3
print a
#[ 1 3 3 4 ]
#Remove value from list
a.remove(1)
print a
#[ 3 3 4 ]
a.remove(3, globl=True)
print a
#[ 4 ]
#Pop operation
a.pop(1)
#returns 'a' and remove (1, 'a') from dictionary
#Length property
a.length
#returns 1