Fundamentals of Building a Search Engine using Python

Learning to Crawl

Data Structures (Python Lists) 

The list data type has some more methods. Here are all of the methods of list objects:

Add an item to the end of the list. Equivalent to a[len(a):] = [x].
Extend the list by appending all the items in the given list. Equivalent to a[len(a):] = L.
list.insert(i, x)
Insert an item at a given position. The first argument is the index of the element before which to insert, so a.insert(0, x) inserts at the front of the list, and a.insert(len(a), x) is equivalent to a.append(x).
Remove the first item from the list whose value is x. It is an error if there is no such item.
Remove the item at the given position in the list, and return it. If no index is specified, a.pop() removes and returns the last item in the list. (The square brackets around the i in the method signature denote that the parameter is optional, not that you should type square brackets at that position. You will see this notation frequently in the Python Library Reference.)
Remove all items from the list. Equivalent to del a[:].
Return the index in the list of the first item whose value is x. It is an error if there is no such item.
Return the number of times x appears in the list.
Sort the items of the list in place.
Reverse the elements of the list in place.
Return a shallow copy of the list. Equivalent to a[:].

An example that uses most of the list methods:


>>> a = [66.25, 333, 333, 1, 1234.5]
>>> print(a.count(333), a.count(66.25), a.count('x'))
2 1 0
>>> a.insert(2, -1)
>>> a.append(333)
>>> a
[66.25, 333, -1, 333, 1, 1234.5, 333]
>>> a.index(333)
>>> a.remove(333)
>>> a
[66.25, -1, 333, 1, 1234.5, 333]
>>> a.reverse()
>>> a
[333, 1234.5, 1, 333, -1, 66.25]
>>> a.sort()
>>> a
[-1, 1, 66.25, 333, 333, 1234.5]
>>> a.pop()
>>> a
[-1, 1, 66.25, 333, 333]

You might have noticed that methods like insert, remove or sort that only modify the list have no return value printed – they return the default None.[1] This is a design principle for all mutable data structures in Python.

 Nested Lists

Elements of a list can be any type you want, you can also mix and match different types of elements in a list. For example:
mixed_up = [‘apple’, 3, ‘oranges’, 27, [1, 2, [‘alpha’, ‘beta’]]]
or a more useful example:
beatles = [[‘John’, 1940], [‘Paul’, 1942], [‘George’, 1943], [‘Ringo’, 1940]]
This list provides information about the names of the Beatles band members, as well as when they were born. Try putting this into your interpreter. When you are typing your code into the interpreter and you want to separate data onto two lines, do so after a comma to make it clear to the interpreter that this is still one list.
beatles = [[‘John’, 1940], [‘Paul’, 1942], [‘George’, 1943], [‘Ringo’, 1940]]
print beatles
[[‘John’, 1940], [‘Paul’, 1942], [‘George’, 1943], [‘Ringo’, 1940]] print beatles[3]
[‘Ringo’, 1940]


Lists support mutation. This is the main difference between strings and lists. Lists can be mutated, thus changing the value of an existing list. Here is a list:

p = [‘H’, ‘e’, ‘l’, ‘l’, ‘o’]
Mutate a list by modifying the value of its elements:
p[0] = ‘Y’
This expression replaces the value in position 0 of p with the string ‘Y’. After the assignment, the
value of p has changed:
print p
[‘Y’, ‘e’, ‘l’, ‘l’, ‘o’]
p[4] = ‘!’
print p
[‘Y’, ‘e’, ‘l’, ‘l’, ‘!’]

Now that you know how a mutation modifies an existing list object, you will really be able to see how this is differs from strings when you introduce a new variable.
p = [‘H’, ‘e’, ‘l’, ‘l’, ‘o’]
p[0] = ‘Y’
q = p
After this assignment, p and q refer to the same list: [‘Y’, ‘e’, ‘l’, ‘l’, ‘o’]. Suppose we use an assignment statement to modify one of the elements of q:

### Aliasing
p = [‘H’, ‘e’, ‘l’, ‘l’, ‘o’]
p[0] = ‘Y’
q = p
q[4] = ‘!’
print p

[‘Y’, ‘e’, ‘l’, ‘l’, ‘!’]

After the q = p assignment, the names p and q refer to the same list, so anything we do that mutates that list changes that value both variables refer to. It is called aliasing when there are two names that refer to the same object.


<list>.index(<value>) → <position> or error
The index method is invoked on a list by passing in a value, and the output is the first position where that value sits in the list.

for Statements

The for statement in Python differs a bit from what you may be used to in C or Pascal. Rather than always iterating over an arithmetic progression of numbers (like in Pascal), or giving the user the ability to define both the iteration step and halting condition (as C), Python’s for statement iterates over the items of any sequence (a list or a string), in the order that they appear in the sequence. For example (no pun intended):


>>> # Measure some strings:
... words = ['cat', 'window', 'defenestrate']
>>> for w in words:
...     print(w, len(w))
cat 3
window 6
defenestrate 12

If you need to modify the sequence you are iterating over while inside the loop (for example to duplicate selected items), it is recommended that you first make a copy. Iterating over a sequence does not implicitly make a copy.

More about the crawling process in the next post…


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s