Modularity

In your exit tickets from a few weeks ago, one of you picked up on a stray comment about how the modularity algorithm relies on randomness to identify communities in a network dataset. This is the reason why, even though you were working with the very same data, some of you got the result that Emily and Tiffany were in the same community, whereas others showed them in separate communities, after running the modularity algorithm in Gephi. There are many ways of calculating modularity, and I don’t actually know how it has been implemented in Gephi. Modularity, according to Mark Newman, is defined as “the number of edges falling within groups minus the expected number in an equivalent network with edges placed at random.” So you see that a) there is a calculation that has to occur; b) the dividend, or top number, will describe the edges in our actual dataset; whereas c) the divisor will be a number that accounts for the same nodes and the same node degrees as our dataset but with a random distribution of edges. See Scott Weingart’s “Networks Demystified 5: Communities, PageRank, and Sampling Caveats” for a helpful, non-mathy explanation of modularity. See the Wikipedia entry for a slightly more mathy description.

Literary Genres

Download this file as: [1] an R Markdown file; [2] a PDF.

An exercise to accompany the Underwood reading

In “The Life Cycles of Genres,” Underwood uses predictive modeling to find a computational method of reproducing historical judgments about literary genre. The question he is testing is: “Do different observers [of literary genre] actually agree?” Underwood’s model classifies genre using the “bag of words” technique, which is to say it identifies vocabulary that is distictive to one or more genres. As the markers of historical judgment, he uses Library of Congress genre/form headings, applied by catalogers to describe the contents of books, together with thematic bibliographies and other authoritive sources. You can see LCGFH in action in our own library catalog, here, for example; note the clickable subject headings in the “More on this subject” box on the lower right.

In this exercise, we’re going to examine Underwood’s methods both by (re)reading his article and by looking at some of the data he collected to build his argument. It is still somewhat unusual for humanities scholars to publish their data, so here we have a rare opportunity to pick apart the building blocks of an author’s argument. To download the data and scripts associated with Underwood’s article, go to http://dx.doi.org/10.7910/DVN/XKQOQM. I suggest that you save and unzip the file to your desktop so that you can follow the portions of this article that use the R language without the need to modify any file paths. If you’re doing this exercise for credit, please email me your responses to the questions by Friday, April 7, 2017.

  1. Take a look at the Appendix A: Metadata at the end of Underwood’s article in Cultural Analytics. This appendix contains the explanations of the genre tags used by the author. Then, open the file named readme.md in the main folder of the data/code download; it will open with any text editor such as TextEdit (Mac) or Notepad (PC). This is where Underwood provides an overview of his research project and describes all the component files of his data and code. From his readme.md you can tell that the folder meta has the metadata (or data about the novels) for his research project. Open the file called finalmeta.csv with a spreadsheet application, find 1-2 novels that you have read, and identify the genre tags that the author has applied. What, if anything, do the genre tags tell you about the place of the novel as understood by literary critics? Do the genre tags make sense to you? Had you situated the novel differently, in terms of genre or type? If so, how?
  2. What is the explanation for why the author applied different genre tags that ostensibly describe the same genre, i.e. “det100” and “locdetective” both are used to identify the genre “detective fiction”?
  3. Under the “Detective Fiction” heading of the Cultural Analytics article, take a look at Figure 2 and skim the accompanying text. In looking at the areas of annotation in Figure 2 (e.g. circles, arrows, names of works and authors), describe two ways in which Underwood’s model suceeded, and two ways in which it failed in predicting the “detecive fiction” genre. Can you hypothesize some reasons for its success or failure? For a good reference work to look up authors, such as the less well known Anna Maria Hall, I recommend the Literature Resource Center.
  4. Returning to the data and code, open the folder called lexicon and examine the file called new10k.csv in a spreadsheet application. The readme.md file explains that this is a word frequency list for the entire text corpus analyzed in the article. Open new10k.csv using a spreadsheet application. Choose 1-2 words among the most frequently used, and 1-2 among the least frequently used, and form a hypothesis that explores some possible reasons for those frequencies. For example, what are some reasons that “eyes” might be so often mentioned (raw freq: 943), and yet “eyelashes” are not (raw freq: 184)? The high and low frequency terms do not have to be related, and keep in mind that there are no right or wrong answers. Consider this a starting point to a research question that you might like to explore!
  5. Scan Eric Holscher’s “A beginner’s guide to writing documentation.” What aspects of code and data reuse, as outlined in Holscher’s guide, does Underwood’s documentation of his research project address?

The next few questions are optional, and rely on R and RStudio being installed on your computer. You can find instructions for installing R for your operating system at CRAN, and for installing RStudio Desktop at RStudio’s website. Or, you can request an account on https://apps.rutgers.edu/. This is a virtual environment with both R and RStudio pre-installed.

As a first step to working with data in RStudio, you need to set your working directory – using the setwd() command – to the file with the scripts and data you plan to use. Save this R markdown file, literary-genres.Rmd, to the expanded folder called tedunderwood-fiction-38d238c for these code chunks to work. To clarify, you should have on your desktop a file called tedunderwood-fiction-38d238c with these contents:

Click the green arrow to the right to run each code chunk. Alternately, set your cursor on the line you wish to run, and click ⌘ (command) + return (Mac) or Ctrl + Enter (PC) to run just that line of code.

# Set working directory 
setwd("~/Desktop/tedunderwood-fiction-38d238c") 

# What's my working directory?
getwd()

# install packages, if they're not already on your system
install.packages(c("dplyr", "knitr", "tidyr", "ggplot2"))

# load libraries and data
library(dplyr)
library(knitr)
library(tidyr)
library(ggplot2)

genres <- read.csv("meta/finalmeta.csv", stringsAsFactors = FALSE)
  1. Click on the genres data frame in your Environment panel. Examine the column headers. There are several variables that Underwood did not specifically address in his article. For example, [author] nationality. What is the profile of his text corpus, viewed from the lens of nationality? Do you feel that the authors included in the dataset are representative of the international character of the genres under examination? Otherwise put, to what degree is “detective fiction” a distinctly Anglo-American phenomenon? Or “Gothic”? Or “science fiction”? Check out Literature Resource Center if you’re not sure.
# Author nationality, arranged in descending order by frequency
genres %>% 
  select(nationality) %>% 
  group_by(nationality) %>%
  summarize(nat_count = n()) %>% 
  arrange(desc(nat_count)) %>% 
  kable()
  1. How do you feel Underwood did with the author gender distribution in his dataset? Does it change over time?
gender <- genres %>% 
  filter(!is.na(gender) & gender != 'us') %>% # get rid of blanks and 'us' value
  select(firstpub, gender) %>% # select only those columns we're interested in
  group_by(firstpub, gender) %>% 
  summarize(total_by_gender = n()) %>% 
  spread(gender, total_by_gender) # create columns with the counts of male and female authors
 
# plot male and female authors over year of first publication
ggplot(gender) + 
  geom_point(aes(x=firstpub, y=f), color="red", shape=0) +  # women authors in red
  geom_point(aes(x=firstpub, y=m), color="blue", shape=2) + # male authors in blue
  geom_smooth(aes(x=firstpub, y=m+f), color="black") + # a trend line showing both genders together
  labs(
    title = "Gender of Author",
    x = "Year of First Publication",
    y = "Number of authors by gender"
  )


Network Analysis Gephi Lab

I found the Gephi software tool to be very interesting to use. It is a great tool for creating visualization for data networking. In this particular case we used data based on the preferences of each student in the class. Using Gephi we can create charts to analyze data in a neat and organized matter.

 

 

Everything can be separated into different color coded sections making it easy to tell what is connected with what. This makes it easy for us to understand the data and make connections as to why the data is what it is. It’s also interesting to see how the size of each node and their connections to each person are important to the understanding of each person’s answers.

(Unfortunately I had some technical difficulties when turning on labels as my screen seemed like it was glitching out).

 

 

By visualizing nodes and what connects them we can see how everything comes together. This lab also helped me to obtain a better overall understanding of correlations in data networks, getting a hands- on experience for myself.

While I have not tried everything that Gephi had to offer, I still found that Gephi is great at collecting information and demonstrating the links between each subject. I enjoyed using Gephi as a new way of looking at data.

Gephi lab

In this lab, we explored how to visually represent data in a network format. We used the Gephi software. Regarding data sets, we surveyed the class about various questions including each individual’s favorite book or favorite movie. We were able to filter information by specifying what data went into the nodes and edges categories of the software program.

Nothing failed in this lab but it was difficult to initially navigate the software because of visualization constraints. The color bubbles are very effective in displaying categories and relationships, but the text and zoom features are more difficult to navigate thus making the software harder to use. Therefore, my struggles were software-related and it would be unlikely for me to use it again for these reasons. Additionally, it is hard to make both generalizations and general statements about the data since our sample size was so small (6). I could say “most of us” liked something but that could merely mean 4 individuals liked something; in a real world setting, this is a tiny sample size. It is therefore difficult to measure relationships and patterns.

Despite difficulties in describing large trends and patterns, there were some similarities of preferences from the data. For example, a relative center to the data was “some type of ethnic food” which means that many individuals of the 6 would choose some type of ethnic food as a food option. This node is shown as bigger than the other nodes, is relatively central to the network visualization, and is connected to many individuals by edges; therefore, we can infer that “some type of ethnic food” is preferred by many individuals in the class. Other relatively central nodes are the “Zimmerli Art Museum” and “flying,” which show that many individuals preferred these options and are therefore linked through this. Some partial outliers are “mind-reading” and “the most expensive place in town” which indicate that only one individual preferred these options, as they are only connected to one individual. Additionally, the nodes that no one preferred are also included as unconnected outliers.

After completing the reading, I found it interesting that I could physically see and understand the nodes and edges when they were applied to fairly easy-to-understand data. I had somewhat of an idea of what nodes are from my international relations classes, but it was interesting to see nodes in the setting of a network graph like such, not a tree diagram. This lab further explained the functionality, applicability, and effectiveness of network style data representations, but it it accentuated the readings in that it gave a hands-on activity and a visual to the knowledge in the article.

This image shows the relationship between preferences of the individuals in the Data Mining Byrne class. The larger text and larger bubbles indicate more centrality in the term since it is connected to more individuals and more individuals prefer the items in the larger nodes. The outliers depict the non-centrality of the nodes since only one individual has this preference. What is useful with this type of visual depiction is the usage of color. I believe this distinguishes these types of network charts from many others because each individual is assigned their own color and this allows for the visual to be more easily viewed. I can now easily see what each individual prefers since they have their own color, and I can see where this preference overlaps with the preferences of another. Color is used in a simple yet effective way, and this really distinguishes this type of network chart from other visual data depictions.  

Lab #2: Gephi Networking

 

Completed network

I really enjoyed using Gephi as a network creation and visualization tool. It was pretty simple to use, although I didn’t really play around with all the different settings. I did like that Gephi allows for so many customizations, whether with simple visuals like the colors of the nodes, to customizations more related to the dataset given, such as the size of the nodes with relation to its frequency.

Network with names and responses nodes, and color-coded based on responses

I thought the above visualization was the most interesting, because it shows all the results, but also color-codes based on which nodes tended to correlate with each other. I do wish Gephi recognized the lengths of the text fields on the nodes, and could organize the network so it all fit better visually, but it wasn’t too difficult to drag to position some of the nodes myself.

From this visualization, it seems that Tiffany and I had so many responses in common that we are the same color node. I also shared at least one other response node with every other person in the class, except for with Jack.

Network with names and responses nodes, with color

This network, above, is an extremely basic one, but I didn’t change the positions of the nodes like I did in the previous visualization, so you can see which nodes are shared among the six of us, and which ones were unique. The nodes that branch off our larger “name” nodes away from the center are our unique responses. The ones all piled up in the middle are ones that aren’t unique to one person.

I thought this was a cool lab, and I wish we had more time to explore all of the different options (such as the Layout menu; it’d be interesting to see what other built-in methods for layout Gephi offers). But Gephi is definitely a powerful tool to help visualize and find correlations in sets of data, which can be useful for many different applications.

I think that Gephi would definitely be useful in analyzing correlations in large sets of data; for example, if there were more questions, or more people being surveyed. I can imagine that this would be good for visualizing data from social media, to see what users like or enjoy. I think of the example from this week’s reading, which focused on Facebook as a social media example. Running statistical tests on the data of Facebook users might produce lists of users and their interests, but I wonder if Gephi or similar programs could help to visualize these lists. It might not be effective on such large data sets (with so many names!) but maybe if condensed into locations where users live, we can see what interests they have. There are a lot of potential uses of Gephi and similar programs to let people see the data in a visually pleasing network.

Lab #2

I was really interested by the lesson on networks, despite all of my tech problems. The software showed us data in another way – we hadn’t seen anything like it before and I don’t think we’re going to see anything like it again in the sense of what it does. I think that this was tied much closer to the reading than the last lab was tied to that reading and that was kind of cool – seeing it in effectively real time. The reading had a lot on nodes and edges and it was all vital to the Gephi application.

 

I couldn’t get the software to work as well as basically every other student in the class so you’ll have to excuse me for being a bit lighter on screenshots than the rest of my compatriots. I did try after class to set it up again but I didn’t get anywhere and this past week has been rough so unfortunately I couldn’t swing by office hours.

 

Back on topic, I really liked seeing the data and thus the connections visualized for us – it was a nice riff on the idea of an interconnected world and I did appreciate it despite the fact my computer clearly did not as evidenced by the massive amount of tech screw-ups I had in an hour long period – it probably earned an award.

 

Depending on the kind of network, not only were the nodes and the edges different, the way they actually hooked up to each other was leagues apart from one another. I only have one example but

 

Depending on the message one wants to send, all you realistically need to do is change the one you use – this actually reminded me of a lesson we had drilled into our heads during AP Statistics during my senior year of high school – facts and stats can lie and anyone with even a rudimentary knowledge of either one can lie with them. This also demonstrated to me how easily one can lie with data – just connect the nodes in very different ways.

 

What do I mean by that? In class, you gave us a number of ways to try out and I’ll post the only one screenshot I got – the only couple times this worked for me.

 

I’m just going to compare 2 – the first is category and the second is attribute. It’s the same data both times but please note that the colors change. It looks very similar but because of that one change, people would start to think there are great differences.

 

 

 

Over all, despite all my tech problems I really enjoyed this lab and hope we do more with Gephi as this kind of data visualization is new and I hope that we don’t wind it down now.

Network Analysis Recap

This week’s lesson on networks was really intriguing. Usually, we read about opinions and lessons, but don’t actually get to see them in action. Scott Weingart’s blogposts on explaining networks provided good insight on the fact that networks can applied to many, if not all, situations. He also described how networks are composed of nodes (items, stuff) and edges (arc, links, ties). It was cool to see that Gephi also utilized the terms “nodes” and “edges” in its program, because it was a direct connection to the readings that we did.

One of the coolest things about Gephi was that you can alter the visualization of the graphs based on what you want to look for. So essentially, even though all the graphs display similar data and results, you can visualize them in different ways to get different understandings.

This network is one of the most basic level visualization to describe the data set. It represents the two nodes: student and response types as well as the edges: the connections between the nodes. It’s clear to see who the students are and what the response types are based on the color coordination.

This is a multi-mode network in which the extraneous edges and nodes have been removed. It contains a left matrix of preference-person and a right matrix of person-preference, making the final network a representation of preference-preference. While interesting to see, it can be difficult to decipher with the amount of edges there are visible.

This network demonstrates the relationships between the students based on the number of connections they share in terms of their answers to the various questions. As evident, I have a pretty strong connection with Emily and Divya and some of us have connections to almost everyone while some, like Jack, only have connections to a few people. I thought this network was fairly interesting because it simplified all the data that was collected into a nice yet highly informative network in broad terms. You don’t necessarily need to know what everyone’s answers to each of the questions were, but what you can take from all this data is the similarities and differences among all the students.

This network is also really cool to look at in my opinion. It provides all of the data in one network, from the students (color coded) to all the survey answers INCLUDING the ones that were not chosen. The orientation also has been adjusted to make all the data clear with the answers less chosen on the exterior and the more popular answers closer to the middle. The size of the nodes also represent the amount of students who chose the answer since it can be difficult to decipher and count the edges.

It was particularly helpful to work with data that we participated in. Sometimes, looking at other data and networks is interesting, but can be not meaningful when we’re just looking at it for observation. While we were also just looking at this data for observation, it was still intriguing because it showed common interests as well as differences among students and certain communities that may come about as a result of that.

Lab #2: Network Analysis

Download this lab as a PDF.

Introduction

Today’s lab is based on Miriam Posner’s assignment on network visualization, with additions from various other humanities and social science network experts, including Martin Grandjean and Clément Levallois.

We’ll use a free application called Gephi, which you can download and install on your laptops. Gephi is a powerful tool for network analysis, but it can be a bit overwhelming at first. It has a lot of tools for statistical analysis of network data, most of which we won’t be exploring in this introductory lab.

Check for Updates

Before we get started today, I am going to ask you to update your version of Gephi. Go to Help > Check for Updates. Even if you just installed the software, there could be some updates to third party plugins to install. Next, go to Tools > Plugins, navigate to the Available Plugins panel, and search for the Multimode Networks Transformations plugin. Check the box under “Install,” and then press the Install button underneath. Gephi will prompt you to restart the application.

Getting Oriented

Let’s first get familiar with Gephi looking at one of the sample datasets. Go to Window > Welcome and click on the Les Miserables.gexf file. This is an example of a character co-appearance network, based on Victor Hugo’s Les Misérables. You’ll see an Import Report telling you that this is an undirected graph with 77 nodes and 254 edges. All looks good, so click OK. Click on the Data Laboratory panel in the upper navigation menu and have a look at the data table containing 76 nodes. You’ll see column headers called Id, Label, Interval, and Modularity Class. Someone has already either run a modularity algorithm on this dataset, or they coded the data manually. Modularity is a useful form of “community detection.” More on that here. Communities will have dense connections in between nodes, and sparse connections with nodes in other communities, very much like friend or family groups.

In the upper lefthand corner, next to nodes, you should see a tab called edges. Click on it. This is the edge list, also known as the connections or relationships in between the nodes. Examine the column headers: Source, Target, Type, and so on. You can flip back and forth between your edges and nodes tables to see which character in Les Misérables the id values under Source and Target refer to. We’ll note that this is an undirected network, meaning that the relationships are symmetrical. In other words, Javert talks to Valjean, but it could just as truthfully be said that Valjean talks to Javert.

In the upper lefthand side of the navigation menu, click on the Overview panel. Nice, but what on earth are we looking at?? Click on the T (show node labels) on the lower navigation menu to display the node labels. Aha! Next, click on the Preview panel. This is where you would go after you’ve got everything looking just the way you want it in the Overview panel. Click the Refresh button. Your labels will have dropped out, but you can add them again by clicking on Show labels in the menu to your left and clicking on Refresh. You may find that the Proportional size option impairs the readability of the graph. If so, uncheck it, and adjust the font manually to something like 24 pt, and click Refresh once more. Notice that you can export your graph in a variety of formats. I personally find that exporting as a PDF leaves me with the most flexibility for later reuse.

Importing a Dataset

Go to File > Close Project (don’t save). Download the class-nodes.csv and class-edges-undirected.csv files to your desktop. I’ve also provided you with the raw data from the Google Form for reference: class-data.csv. Now we’re going to import our own dataset: go to File > Create New Project. Navigate over to the Data Laboratory (central panel), and click on Import Spreadsheet. Click on the button with the three dots to select class-nodes.csv. Be sure you choose Nodes table from the box that allows you to choose between an edges table and a nodes table. Finally, click Next to move on to the next screen. Make sure the box next to Force nodes to be created as new ones is checked, and click Finish.

Next, click on Import Spreadsheet once more. This time, when you click on the button with the three dots, choose the class-edges-undirected.csv file. Make sure you choose Edges table from the box that allows you to choose between an edges table and a nodes table. Click Next. In the following window, be sure that the box next to Create missing nodes is left unchecked. Click Finish.

To review, the Data Laboratory is where you can manipulate the data you’ve uploaded. If you click on the Nodes or Edges tab, you can toggle between the two spreadsheets.

Start Visualizing

OK, we can finally start visualizing. Click on Overview to go to the panel that will show your network graph.

 

You might be looking at something that looks a bit like a clump of hair somebody left in the shower drain. Huh. Not very exciting just yet, but be patient. Use the scroll wheel to zoom in and out.

  1. Use the hand icon to move the diagram around.
  2. Turn labels on by clicking the T.
  3. Adjust the size of the labels with the scrubber.

What are we looking at? This is a bimodal network graph, meaning it contains two different kinds of things: students and their preferences. Each student is connected to his or her preferences by an edge. It’s still a bit of a mess, though.

Size Nodes

Let’s give nodes a size proportional to their degree (sum of connections). In the Ranking panel of the left column (top), select “Nodes” and the “Size” icon (looks a bit like a layer cake turned on its side), then select “Degree” in the dropdown menu and enter the minimal and maximal value (try 10-100). Click “Apply.”

Spatialization

Let’s put a little space in between those nodes. In the Layout panel, choose the Fruchterman-Reingold algorithm with the following settings:

  • Area: 20,000
  • Gravity: 10
  • Speed: 10

Fruchterman-Reingold is a random layout algorithm that disposes nodes in a gravitational way (attraction-repulsion, like magnets) on the screen. Click on Run, then Stop once the nodes are sufficiently spaced.

Then, try another layout algorithm: the Force Atlas 2. This one will disperse groups and put space around larger nodes. Be mindful that the parameters you enter can hugely alter the final appearance. I suggest that you check the box next to “prevent overlap” and change “Scaling” to 200. Let the function run until the graph is mostly stabilized.

Style and Centrality

Now let’s add some color so we can distinguish between students and their preferences. In the upper left-hand portion of the screen, click on the palette icon (color). Underneath that, you’ll see two tabs: Nodes and Edges. Select Nodes. Within the Nodes tab, you’ll see three additional tabs: Unique, Partition, and Ranking. Be sure that the Partition tab is selected. Then, from the dropdown menu, select category. Click Apply. Now you can distinguish between the students and their preferences! What observations can we make about the class at this point?

We’ll try a second way of color coding our network graph. In the Partition dropdown menu, change the selection from category to attribute. Click on the “Palette” hyperlink in order to create a custom palette for our nodes. Select the “all grey” option, and click Apply. Next, scroll down in your attributes until you see “m” (male) and “f” (female). We’re going to, you guessed it, color code the class by gender. Click on the squares next to “m” and “f” and choose a unique color for each attribute. Click Apply once more.

Let’s add some more information to our graph by giving the nodes new attributes. Go to the Statistics panel on the righthand side of your Gephi window and click the Run button next to Avg. Path Length. Then close the Graph Distance Report that pops up. This algorithm measures the average graph distance between all pairs of nodes.

Centrality measures are difficult to grasp at first. Levallois’s slide on the subject might help.

 

Click on the size (sideways layer cake) icon. Go to the Ranking panel of the nodes tab. You will notice when you click the dropdown menu that you have some new options for sizing your nodes. This is an undirected graph, meaning that the edges don’t have any directionality.1 Try scaling the nodes by In-Degree, then Out-Degree. If your labels get a bit snarled, then go to the Layout panel, select the “Label Adjust” algorithm, and click run. Can you understand what each degree measures? Is one more meaningful than the other? Why? Leave the nodes sized according to the ranking of your choice (Degree, InDegree or OutDegree).

Modularity

Let’s see if we can identify clusters of students who have things in common. To do this, we’ll calculate modularity. On the Statistics pane
(at the right of your screen), click on the Run button that appears next to Modularity. In the next popup window, click OK, then close the Modularity Report when it pops up. Now that we’ve calculated modularity, we can color nodes according to their communities. To do that, go to the palette icon (Color), Nodes tab, and Partition pane. From the dropdown window, select Modularity Class. Finally, click Apply. Now we can see which students’ preferences link them together into communities. Students and/or preferences with closer (or more) ties are shaded in the same color. What, if anything, does this data visualization mean to you?

Projection to One-Mode Graph

Use the MultiMode Networks Projection panel, which should be lurking to the right of the statistics panel. First, go to File > Save and save your network, since this next step will overwrite your data. Click on  “load attributes.” You’ll now “project” the preferences onto the students: if two students have an edge linking them with the same preference, the preferences will now have a direct edge between them (and the student names will disappear). Select category as the attribute type, and set the matrix as proposed here:

  • preference-person
  • person-preference

They must be symmetric with the type of node you want to keep at the beginning and the end (preferences).

The Monopartite Graph of Preferences

Check the “Remove Edges” and “Remove Nodes” buttons, in order to clean the graph of the unselected nodes and edges. And finally click on
Run. This produces a preference-preference graph. Be mindful that you just overwrote your original network data. If you need to take a step backwards, close the project without saving it, and reopen your saved project.

With this 1-mode flattening of our class network, notice how we can finally go to the Ranking pane of the Edges tab, and color our edges by Weight. There are lots of ways to assign weight to an edge, but the simplest expression is the sum of edges. Click on the checkered square to choose a color ramp that will make the stronger edges more visible in your final display. Then, in the Preview panel, go to the left sidebar under Edges and click Color > Original. Click Refresh to get your selected color ramp to display. Now you can see more clearly the preferences that you hold in common, as well as those preferences that were outliers.

And with that, you have created a social network graph of our class. Congratulations! Note that you can also do the reverse: project the students onto the preferences. That said, given that you all answered the same number of questions about your preferences, it seems doubtful that this method would create the most meaningful visualization.

Save and Share

You can save your Gephi graph as a Gephi file, or export it as a .gexf file, so you can open it up again later and edit it. You can take a screenshot from the Overview panel (click on the tiny camera). You can also click on the Preview pane to see a somewhat nicer presentation of your network diagram, and you can change the look of it on the left-hand side of that pane. (Be sure to click Refresh after each change.) Once you’re happy, click on the SVG/PDF/PNG button to export it as an image file.

Write a Lab Report

Finally, write your second and last Lab Report about what you learned in this lab on networks. I haven’t given you much choice in the matter of dataset, so focus specifically on the questions of the relationships, trends or patterns observed, aspects of the methodology (network analysis) or tasks (algorithms) that you found interesting or perplexing, and reasons why, and any connections to the readings assigned for today.

Other Network Applications?

After this baptism of fire, you may find other network applications – like Palladio (graph view) and Cytoscape – relatively intuitive to use. If you are learning to code in R, you might also want to explore the igraph and ggnet packages, among others. Dr. Ognyanova of the School of Communication has created excellent network analysis tutorials in R.

Where to Next?

How might you restructure the data you collected in the first lab, Data Capture, so that you could visualize the nodes (Twitter users) and the edges (the direct replies and/or mentions) using Gephi? What could it help clarify that you didn’t already know from the TAGSExplorer visualization (another network graph)? Take a peek at Clément Levallois’s datasets for guidance on Twitter data formatting.

What about analyzing and visualizing other social media data as a network? Here’s a way of approaching Instagram hashtags:

Here’s an interesting tutorial on using the semantic web to query Wikipedia and visualize the results as a network.


  1. There’s an argument to be made that our class network should be directed. It may be more correct to say students like films, as opposed to suggesting a symmetrical relationship, as in film -> student is equal to student -> film. We’re treating this as an undirected network however, if only because the Multimode Networks Projection plugin can’t seem to handle directed networks.  

Lab1

I chose to explore the popularity and ranging biases of Twitter users on the subject of the recent executive order known as the Muslim ban (or travel ban). I selected recent tweets with the keywords “Muslim ban” in the tweet. I added a “retweet count” and “favorite count” filter to see the popularity of the individual tweets. This screen capture is representative of the relationships between tweets regarding the Muslim Ban. The “reply” feature of TAGSExplorer illustrates when one user replies to another user or group of users. This allows us researchers to analyze who the popular Twitter figures are, to see who generates the most replies, and to look for potential anomalies. It is important to analyze such relationships because then we as researchers are able see a visual map of the impact of each tweet and twitter users. We can conceptualize through this map a network of interactions.

 

This activity was very relevant to the “Ferguson: digital protest, hashtag ethnography, and the racial politics of social media in the United States” article because it was interesting to see how relevant and popular Twitter is with regards to current events.However twitter, like all social medias, is a platform on which people spread their biased opinions. All opinions have bias and all social medias allow for these opinions to be spread. It can become chaotic when there are controversial events and many tweets are being tweeted, which is something I learned from this lab. Additionally, my data was searched through the term “Muslim ban” however this term in itself a generally liberal perspective on the executive order. The executive order  is also known as travel ban and this likely represents a different set of opinions on the issue. Selecting a controversial topic to research and analyse seems to pose these problems because there is inherent dispute in the name of the ban. Beyond this, there are differing opinions on the details, constitutionality, and legality on the ban, which heavily skews the analysis of the tweets. With such controversy and debate, exploring the general direction of the popular opinion is difficult; however, this can also be attributed to the incredible political divisiveness at the moment. Despite such problems, one can still explore what Twitter users believe and what sources the use to back up controversial issues. Twitter brings attention to what is happening daily, minute by minute. Twitter serves as a platform for people to speak their political opinion, for those who don’t feel comfortable doing so in public. Although this is another form of bias, it allows us researchers to see more opinions that we wouldn’t if we did an in-person poll, for example. Utilizing data capture engines such as TAGS lets us explore what we wouldn’t necessarily know through other means.

After analyzing these data, there is no conclusive and general statement I can make as there is so much division on the matter. However, it is important to read the data and see how figures such as “@realdonaldtrump” are trending and evoking many responses. It is important to observe and analyze what is tending and happening now, as it crucially supplements the news and current events by adding personal and individualistic opinions to the situation.

#Resist

In light of the recent presidential election, I chose to explore the hashtag resist (#resist) which has been a way for activists to get connected on twitter. It has been a flashpoint – people have made statements via the hashtag and have heard about events through it as well (such as the women’s march/march for science). Using the tool we discussed in class – 6.1 I reran the experiment at home to get a snapshot of tweets that included the hastag resist (#resist)

The data I selected to explore the question was the hashtag resist (#resist). After reading the article for last class, I got a lot more interested in the use of twitter for activism – real activism as opposed to slacktivism (being a keyboard warrior). I chose to explore this and using the TAGS tool I saw that people were taking this very seriously – either making their objections to the administration well known or making statements – it seemed this hashtag is akin to the pink hat worn at the Woman’s March.

I decided against applying too many filters because I wanted to see what people thought. The one that I considered the longest was the subscriber count – perhaps raising that would cut the number of tweets but ultimately I decided not to. The biggest reason was I wanted to see what everyone was talking about – including those who made accounts just to be activists.

The relationship was interesting in that a majority of tweets from one time were almost laserlike focused on certain missteps at any certain point from the administration. In my opinion, if this experiment was run a week ago or is run again in the future, the entire thing should and probably will in change. The other trend is that it’s all focused towards the presidency which is a mistake in my opinion. It needs to be more focused towards all levels of government.

This screenshot of the data shows the rapidity of the tweets and how much the hashtag was used (Figure 1)

This experiment went as I’d expect however even though I consider myself relatively well plugged in, I was surprised at the rate at which things moved and were tweeted.

The findings of this experiment verify the arguments of the readings – as far as activism is concerned social media is a large part of that. I’ll admit being surprised at this but I guess that twitter really is the new form of activism.