Socialite ConnectionsShowmik's Peapod

Sat 16 September 2017 | tags: python, beautifulsoup, spacy, web extraction, natural language processing, graph theory, -- (permalink)

One of my favorite projects during my time at The Data Incubator was people based. It was cool to see how insights can be drawn about people using the information already out there on the web.

The Objective

Simply put, the main point of this project was to find out who is the most influential! This piece of information can be important for a variety of reasons. In the advertisement/marketing space, it can be important to understand who hold the most influence when trying to find an endorser for a product or cause.

The Information Source

New York Social Diary is an entire website that aims to document the events that these socially influential have attended. They achieve this by pictures! If you take a look at the event page for one of their recent events, here, they have pictures of the socialites with their names in the captions. More importantly they have an entire archive of pictures from these events. Here is the list of the events found here.

Extraction Part 1.

Pulling the links to the party.

The first thing to do was to extract all the links to the parties from the list of events.

So if you take a look at a picture from an event, you can see that the captions are right below the pictures. Taking a look at the browser page HTML source for the page that contains all the links, one can see that all the links to each event are placed in a <div class=”view-rows”> tag and each link lives in a <a ref=’link’> tag within <span> tags. A picture illustrating this can be seen here.

To be precise, the texts within the <a ref> tags aren’t the link themselves but the part of a bigger link address that directs to the page with the event pictures. We can just simply add on that text to the end of http://www.newyorksocialdiary.com/party-pictures/.

In addition, we notice that there are 30 pages of these event links. All pages can simply be accessed by changing the page number on the http://www.newyorksocialdiary.com/party-pictures?page=1 url.

Using a combination of Python’s Requests and BeautifulSoup libraries, extracting these links are easy peasy. Check the code snippit below to give an idea of how this is done for the first page.

This part pulls the raw html code from the first page and prettifys it using BeautifulSoup so it’s easier to read through.

import requests
from bs4 import BeautifulSoup

url = 'http://www.newyorksocialdiary.com/party-pictures'
page = 0 
events_on_page = requests.get(url, params={"page":""})
soup = BeautifulSoup(events_on_page.text, "lxml")

print soup.prettify()

Next as we noticed before, the party links live within the <div class=”view row”> tag we can use BeautifulSoup’s find all method to extract these elements.

links = soup.find_all('div', attrs={'class':'views-row'})
links

This should return a list of div elements.

[“<div class="views-row views-row-1 views-row-odd views-row-first">
<span class="views-field views-field-title"> <span class="field-content"><a href="/party-pictures/2017/hope-and-heroes">Hope and Heroes</a></span> </span>
<span class="views-field views-field-created"> <span class="field-content">Monday, September 11, 2017</span> </span> </div>”, …

To select the name of the event, the link text and the date of the element, I created a function that utilizes BeautifulSoup’s methods.

import datetime

def get_link_date(el):

    el_title = links[el].select('a')[0]
    url = el_title['href']
    date = links[el].select('span')[3]

    return url, datetime.strptime(str(date.text), '%A, %B %d, %Y')

The function builds on the list of links extracted above and selects an element, ‘el’, from that list and parses out the event name, url, and date.

Building everything above into one function:

def get_links(page):

    list_urls_dates = []
    params = {"page":page}
    response = requests.get('http://www.newyorksocialdiary.com/party-pictures', params= params)
    soup = BeautifulSoup(response.text, "lxml")
    links = soup.find_all('div', attrs={'class':'views-row'})
    x = _
    for i in range(len(links)):
        title = links[i].select('a')[0]
        url = title['href']
        date = links[i].select('span')[3]
        date1 = str(date.text)
        y = datetime.strptime(date1, '%A, %B %d, %Y')
        x = (url, y) 
        list_urls_dates.append(x)

    return list_urls_dates

Building an overall function that parses out all the urls and event dates from the rest of the pages is trivial and ill leave that part to the readers of this peapod. ☺

Best time to party?

This simple extraction alone can reveal insights! A simple question one can ask is what months had the most parties? Which month/year was most active for these socialites?

All in all, I had extracted about 1200 event links and dates, this number will vary depending on when you end/start the extraction.

My list of urls looked like this:

url_dates = [('/party-pictures/2017/hope-and-heroes', datetime.datetime(2017, 9, 11, 0, 0)), ('/party-pictures/2017/maison-de-modecom-endless-summer-trunk-show', datetime.datetime(2017, 9, 1, 0, 0)), ('/party-pictures/2017/women-artists-maestros-and-keepers-of-the-flame', datetime.datetime(2017, 8, 31, 0, 0)),
…]

From this list we can assemble a pandas dataframe to keep a count of the events in each month.

import pandas as pd
import numpy as np
from collections import Counter
import datetime as dt

def event_count():

    url_dates
    df1 = pd.DataFrame(url_dates, columns=['event', 'date'])
    df1['date'] = df1['date'].dt.strftime('%b-%Y')
    df2 = df1['date'].value_counts()

    values = []
    dates1 = []

    for i in df2.index:
        dates1.append(str(i))

    for i in df2:
        values.append(i)

    counts = zip(dates1, values)

    return counts

Extraction Part 2.

Pulling the Picture Captions.

So, time for the fun part! Now that we have all these links, we need to scrape out the captions from each picture on the page.

We can use a similar method as we did above using BeautifulSoup. Once again if we look into the source code of any event page, the biggest thing we notice is that the photo captions are located in <div align="center" class="photocaption"> tags. Try it for one of the events.

One thing to note is that, the tags in which the captions live in change over time; older pages (events in 2007) contain the photocaptions in different tags. I’ll leave the investigation of what these tags are to my readers. ☺

Heres how I parsed some of the event pages:

import urlparse

clean_captions = []

def get_captions(path):

    full_path = urlparse.urljoin('http://www.newyorksocialdiary.com/', path) #combines input url path with root path
    url_html = requests.get(full_path) #requests full html page of requested page
    url_html.text #prints the html text
    soup1 = BeautifulSoup(url_html.text, "lxml") #turns the html text into a soup object
    captions = soup1.find_all('div', attrs={'class':'photocaption'})

    for i in range(len(captions)): #loops through captions list and returns the text
        x1 = captions[i].text
        clean_captions.append(x1)#inputs the text into a clean_captions list
    return clean_captions

Simply passing the extracted urls from earlier through the function should return captions. Parsing the pages for all the events takes some time.

After cleaning out the excess spaces, empty elements and the photographer captions, I had a list that looked like the one below.

u' Dr. David Lyden and Patricia Sorenson ',
u' Procession over the eyebrow bridge on way to Painting Gallery',u"Jon Batiste and Marcus Miller at The NAACP Legal Defense and Educational Fund's  28th annual National Equal Justice Award Dinner.",
u'Valerie and Carl Kempner',
 u' Dwight Gooden and Les Lieberman ',
 u' Dr. Amy Cunningham-Bussel, Ray Mirra, and Dr. Tyler Janovitz ',
u'CEO Kate Coyne, Robert Liberman, and Libby Monaco',  
u' Larry Spangler ',
u'Leah Aden with friends',
u' Honorees Andrew Malcolm and Dan Bena ',
u"The scene at Michael's 25th anniversary celebration.",
…

As you can see there’s a lot more than just names. There are descriptions of the scene, lots of formal titles like Dr. and CEO, and couples that don’t include the last names for some of the spouses (u'Valerie and Carl Kempner').

This stuff required more cleaning. I accomplished this using the bit of code below. It took lots of tinkering to finally get it right.

import re

nohonorifics_cleancaptions = []
def junk_title_cleaner(unclean_caption):

    h1=['Mr. ','Guest','U.S. Representative ',' M.D.', ' M.D.,','PhD','Ph.D.',' Jr.',' Sr.','Mrs. ','Miss ','Doctor ','Dr. ','Dr ','Chair ','CEO ','the Honorable ','Mayor ','Prince ','Baroness ', 'Princess ', 'Honorees ', 'Honoree',' MD']
    h3=['De', 'Highness ','Museum President ','Chief Curator ','Frick Director ','left ','right ','honoree ','de ','host ','dressed ','Police Commissioner ','Music Director ','Frick Trustee ','Historic Hudson Valley Trustee ', 'Museum President ','Public Theater Artistic Director ','Public Theater Executive Director ','Executive Director ','Cooper Union President ','The Hon. ','Dancing Chair ','Director Emerita ']
    h2=['Hon. ','Lord ','Senator ','Deputy ','Director ','Dean ','Actor ','Actress ',' Esq.', 'Gov ','Governor ','Father ','Congresswoman ','Congressman ', 'Countess ','Awardee ','Chairman ','Commissioner ','Lady ','Ambassador ','President ','CEO ', ' von', ' van']

    hwords=h1 + h2 + h3

    honorifics = '|'.join(list(set(hwords))) 

    x = unclean_caption.replace(", and "," and ") #Replaces the ',/sand' with '/sand/s'
    x = re.sub(honorifics, '', x)  #Replaces honorifics with empty spaces
    #if x[-1] == ",":
    #    x = list(x)
    #    x[-1] =''
    #    x = ''.join(x)

    x = x.replace(", and "," and ")
    notitles_clean_captions = x #cleaned of all the junk titles and such
    nohonorifics_cleancaptions.append(notitles_clean_captions)

    return nohonorifics_cleancaptions

After dealing with the honorifics, I went after the pesky couples using regular expressions. After parsing out the ‘and’ in the captions of groups and keeping the ‘and’ in captions that had couples, I separated out the couples. This meant turning u'Valerie and Carl Kempner' to u'Valerie Kempner and Carl Kempner'. This is going to be necessary later.

same_family_regex = r'^[A-Za-z]+\s+and\s+[A-Za-z]+\s+[A-Za-z]+'
testcleancaptions_nocouples =[]

###Takes a caption that contains a couple (Mike and Sophie Riddle) turns to (Mike Riddle and Sophie Riddle)
def same_family_couples(caption, same_family_regex):

    fixed_caption = caption
    divided_by_commas = caption.split(',')
    for string in divided_by_commas:
        if len(re.findall(same_family_regex, string.strip())) >0:
            names = string.split(' and ')
            if len(names) == 2:
                first_name = names[0]
                second_full_name = names[1]
                common_last_name = second_full_name.split()[1]
                fixed_string = ' '.join([first_name, common_last_name, ',', second_full_name])
                fixed_caption = fixed_caption.replace(string, fixed_string)
                testcleancaptions_nocouples.append(fixed_caption)

    return testcleancaptions_nocouples

Finally after all this cleaning, we can do some natural language processing using spaCy.

This python package is very interesting. It provides very fast part of speech tagging and can be used to identify names! It works with Unicode text so after converting the list we have back to that form, I ran the following code.

import spacy
nlp = spacy.load('en')

only_names_captions = []

def spacy_name_extraction(unicode_clean_list):
    for i in unicode_clean_list:
        doc1 =nlp(i)
        names = []
        for ent in doc1.ents:
            if ent.label == 346:
                x = ent.text.strip()
                x = x.encode('ascii')
                names.append(x)
        only_names_captions.append(names)

    return only_names_captions

This function goes through each element in the cleaned captions and identifies only the names and tags them. If the tag matches the label 346, which is the code for name in spacy, it is extracted. This essentially removes those descriptive captions that include no names.

The final result should be a super list of the names of every caption from every event picture on the website.

All in all, I ended up with over 80,000 unique names from over 100,000 captions.

That should finish the extraction and cleaning part of this project.

Insights

Now that we have these names, what insights can be drawn from them?

If I were trying to market my product, I would love to get an endorsement from one of these folks. Ideally, everyone should endorse my great product. But since we live in a less than ideal world and I only have a finite amount of cash, I need to find an endorser that would have the most impact; the most popular, popular kid.

So who’s most popular in this social scene? The best way to answer that question is to basically see who appears with the most people in the pictures.

To see this, we need to build a network graph.

Connections

So every time a person appears in a picture with another picture, we can call that a connection or link. To build this graph, I used the python library, networkx.

To get the pairs of connections out of the captions, we definitely need to do some data organization.

There are some pictures with multiple people in them; we need to reduce those captions into lists of pairs. I did that using the python library itertools.

The nodes in this network graph are the unique names in the list and an edge is created in the graph every time a person appears in a picture with another person. The weight of those edges increase when the two individuals appear in more photos together.

The code below represents how I created out the network.

import networkx as nx
from collections import Counter

G = nx.Graph()
G.add_nodes_from(uniquelist) #uniquelist is a list that contains all the unique names that appear in the captions

only_names_captions
edgelists = [sorted(captions) for captions in only_names_captions]
edgelists = [itertools.combinations(captions,2) for captions in edgelists]
edgelists = [list(captions) for captions in edgelists]
edgelists = sum(edgelists, [])

uniquedges=list(set(edgelists))

xedges=Counter(edgelists)
edgecounts=[xedges[namestr] for namestr in uniquedges]

wedges=[0]*len(edgecounts) #wedges represents the weights of the edges
for i in range(len(edgecounts)):
    wedges[i]=(uniquedges[i][0],uniquedges[i][1],edgecounts[i])

G.add_weighted_edges_from(wedges)

degrees = list(G.degree_iter(uniquelist))

Degrees represents the number of edges to a node. Higher the number of degrees, the more connections that individual has. My final top list appeared a little like this:

('Jean Shafiroff', 346)
('Gillian Miniter', 309)
('Mark Gilbertson', 298)
('Andrew Saffir', 204)
('Geoffrey Bradfield', 200)
('Alexandra Lebenthal', 196)
('Somers Farkas', 190)
('Debbie Bancroft', 166)
('Jamee Gregory', 166)
…