DEVELOPMENT OF IMDbPY ===================== A lot of other information useful to IMDbPY developers are available in the "README.package" file. Sections in this file: * STRUCTURE OF THE IMDbPY PACKAGE * GENERIC DESCRIPTION * HOW TO EXTEND STRUCTURE OF THE IMDbPY PACKAGE =============================== imdb (package) | +-> _compat +-> _exceptions +-> _logging +-> linguistics +-> Movie +-> Person +-> Character +-> Company +-> utils +-> helpers +-> parser (package) | +-> http (package) | | | +-> movieParser | +-> personParser | +-> characterParser | +-> companyParser | +-> searchMovieParser | +-> searchPersonParser | +-> searchCharacterParser | +-> searchCompanyParser | +-> searchKeywordParser | +-> topBottomParser | +-> utils | +-> bsouplxml | | | +-> _bsoup.py | +-> etree.py | +-> html.py | +-> bsoupxpath.py | +-> mobile (package) | +-> sql (package) | +-> dbschema +-> alchemyadapter +-> objectadapter +-> cutils (C module) Description: imdb (package): contains the IMDb function, the IMDbBase class and imports the IMDbError exception class. _compat: compatibility functions and class for some strange environments (internally used). _exceptions: defines the exceptions internally used. _logging: provides the logging facility used by IMDbPY. linguistics: defines some functions and data useful to smartly guess the language of a movie title (internally used). Movie: contains the Movie class, used to describe and manage a movie. Person: contains the Person class, used to describe and manage a person. Character: contains the Character class, used to describe and manage a character. Company: contains the Company, used to describe and manage a company. utils: miscellaneous utilities used by many IMDbPY modules. parser (package): a package containing a package for every data access system implemented. http (package): contains the IMDbHTTPAccessSystem class which is a subclass of the imdb.IMDbBase class; it provides the methods used to retrieve and manage data from the web server (using, in turn, the other modules in the package). It defines methods to get a movie and to search for a title. http.movieParser: parse html strings from the pages on the IMDb web server about a movie; returns dictionaries of {key: value} http.personParser: parse html strings from the pages on the IMDb web server about a person; returns dictionaries. http.characterParser: parse html strings from the pages on the IMDb web server about a character; returns dictionaries. http.companyParser: parse html strings from the pages on the IMDb web server about a company; returns dictionaries. http.searchMovieParser: parse an html string, result of a query for a movie title. http.searchPersonParser: parse an html string, result of a query for a person name. http.searchCharacterParser: parse an html string, result of a query for a character name. http.searchCompanyParser: parse an html string, result of a query for a company name. http.searchKeywordParser: parse an html string, result of a query for a keyword. http.topBottomParser: parse an html string, result of a query for top250 and bottom100 movies. http.utils: miscellaneous utilities used only by the http package. http.bsouplxml (package): adapter to make BeautifulSoup behave like lxml (internally, the API of lxml is always used). http.bsouplxml._bsoup: just a copy of the BeautifulSoup module, so that it's not an external dependency. http.bsouplxml.etree: adapter for the lxml.etree module. http.bsouplxml.html: adapter for the lxml.html module. http.bsouplxml.bsoupxpath: xpath support for beautifulsoup. The parser.sql package manages the access to the data in the SQL database, created with the imdbpy2sql.py script; see the README.sqldb file. The dbschema module contains tables definitions and some useful functions; The alchemyadapter adapts the SQLAlchemy ORM to the internal mechanisms of IMDbPY, and the objectadapter does the same for the SQLObject ORM (internally the API of SQLObject is always used). The cutils module is a C module containing C function to speed up the 'sql' data access system; if it can't be compiled, a set of fall'back functions will be used. The class in the parser.mobile package is a subclass of the one found in parser.http, with some method overridden to be many times faster (from 2 to 20 times); it's useful for systems with slow bandwidth and not much CPU power. The helpers module contains functions and other goodies not directly used by the IMDbPY package, but that can be useful to develop IMDbPY-based programs. GENERIC DESCRIPTION =================== I wanted to stay independent from the source of the data for a given movie/person/character/company, and so the imdb.IMDb function returns an instance of a class that provides specific methods to access a given data source (web server, SQL database, etc.) Unfortunately that means that the movieID in the Movie class, the personID in the Person class and the characterID in the Character class are dependent on the data access system used. So, when a Movie, a Person or a Character object is instantiated, the accessSystem instance variable is set to a string used to identify the used data access system. HOW TO EXTEND ============= To introduce a new data access system, you've to write a new package inside the "parser" package; this new package must provide a subclass of the imdb.IMDb class which must define at least the following methods: _search_movie(title) - to search for a given title; must return a list of (movieID, {movieData}) tuples. _search_episode(title) - to search for a given episode title; must return a list of (movieID, {movieData}) tuples. _search_person(name) - to search for a given name; must return a list of (movieID, {personData}) tuples. _search_character(name) - to search for a given character's name; must return a list of (characterID, {characterData}) tuples. _search_company(name) - to search for a given company's name; must return a list of (companyID, {companyData}) tuples. get_movie_*(movieID) - a set of methods, one for every set of information defined for a Movie object; should return a dictionary with the relative information. This dictionary can contains some optional keys: 'data': must be a dictionary with the movie info. 'titlesRefs': a dictionary of 'movie title': movieObj pairs. 'namesRefs': a dictionary of 'person name': personObj pairs. get_person_*(personID) - a set of methods, one for every set of information defined for a Person object; should return a dictionary with the relative information. get_character_*(characterID) - a set of methods, one for every set of information defined for a character object; should return a dictionary with the relative information. get_company_*(companyID) - a set of methods, one for every set of information defined for a company object; should return a dictionary with the relative information. _get_top_bottom_movies(kind) - kind can be one of 'top' and 'bottom'; returns the related list of movies. _get_keyword(keyword) - return a list of Movie objects with the given keyword. _search_keyword(key) - return a list of keywords similar to the given key. get_imdbMovieID(movieID) - must convert the given movieID to a string representing the imdbID, as used by the IMDb web server (e.g.: '0094226' for Brian De Palma's "The Untouchables"). get_imdbPersonID(personID) - must convert the given personID to a string representing the imdbID, as used by the IMDb web server (e.g.: '0000154' for "Mel Gibson"). get_imdbCharacterID(characterID) - must convert the given characterID to a string representing the imdbID, as used by the IMDb web server (e.g.: '0000001' for "Jesse James"). get_imdbCompanyID(companyID) - must convert the given companyID to a string representing the imdbID, as used by the IMDb web server (e.g.: '0071509' for "Columbia Pictures [us]"). _normalize_movieID(movieID) - must convert the provided movieID in a format suitable for internal use (e.g.: convert a string to a long int). NOTE: as a rule of thumb you _always_ need to provide a way to convert a "string representation of the movieID" into the internally used format, and the internally used format should _always_ be converted to a string, in a way or another. Rationale: a movieID can be passed from the command line, or from a web browser. _normalize_personID(personID) - idem. _normalize_characterID(characterID) - idem. _normalize_companyID(companyID) - idem. _get_real_movieID(movieID) - return the true movieID; useful to handle title aliases. _get_real_personID(personID) - idem. _get_real_characterID(characterID) - idem. _get_real_companyID(companyID) - idem. The class should raise the appropriate exceptions, when needed; IMDbDataAccessError must be raised when you cannot access the resource you need to retrieve movie info or you're unable to do a query (this is _not_ the case when a query returns zero matches: in this situation an empty list must be returned); IMDbParserError should be raised when an error occurred parsing some data. Now you've to modify the imdb.IMDb function so that, when the right data access system is selected with the "accessSystem" parameter, an instance of your newly created class is returned. NOTE: this is a somewhat misleading example: we already have a data access system for sql database (it's called 'sql' and it supports also MySQL, amongst other). Maybe I'll find a better example... E.g.: if you want to call your new data access system "mysql" (meaning that the data are stored in a mysql database), you've to add to the imdb.IMDb function something like: if accessSystem == 'mysql': from parser.mysql import IMDbMysqlAccessSystem return IMDbMysqlAccessSystem(*arguments, **keywords) where "parser.mysql" is the package you've created to access the local installation, and "IMDbMysqlAccessSystem" is the subclass of imdb.IMDbBase. Then it's possibile to use the new data access system like: from imdb import IMDb i = IMDb(accessSystem='mysql') results = i.search_movie('the matrix') print results A specific data access system implementation can defines it's own methods. As an example, the IMDbHTTPAccessSystem that is in the parser.http package defines the method set_proxy() to manage the use a web proxy; you can use it this way: from imdb import IMDb i = IMDb(accessSystem='http') # the 'accessSystem' argument is not # really needed, since "http" is the default. i.set_proxy('http://localhost:8080/') A list of special methods provided by the imdb.IMDbBase subclass, along with their description, is always available calling the get_special_methods() of the IMDb class. E.g.: i = IMDb(accessSystem='http') print i.get_special_methods() will print a dictionary with the format: {'method_name': 'method_description', ...}