Text mining scripts in Python
I wrote a set of Python scripts to run numerical analysis on my writing. You can gain valuable insight into your writing by measuring your words. The measurement of your writing can be part of your writing practice, and sets the basis for creating tests when validating hypothesis about your writing.
You can find the text mining scripts in my GitHub repository. The repo contains the following scripts:
- Concordance (word count)
- N-Gram (word pair count)
- Entity extraction (keyword count)
- Search engine keyword analysis (scores keywords)
- Parts-of-Speech (part-of-speech frequency)
- Reading level (reading grade level)
These scripts were partly inspired by a job I had as a social media analyst in 2008. And the desire to create this set of scripts is one of the reasons I started learning Python. Social media analysis mixes two different disciplines: social network analysis and text mining. Social network analysis is the kind of thing that epidemiologists use to find patient zero. For my job, I was using network maps to track who was talking about products and who was listening to them talk about products. For example, I looked at who was talking about salsa. To quickly assess what they were talking about I used some crude forms of text mining. My methods at the time typically involved Excel.
These scripts use the Natural Language Toolkit (NLTK) which has been around for a while.
I have always counted the words in my writing, or in a handwritten journal, the time.
When I was sixteen years old and facing what seemed an essential choice of what I wanted to do with my life, this doing seemed more than a way of making money, but a choice that had to be a vocation. In my sophomore and junior year in high school, the school gave us a military skills test and career tests. I think both tests determined I should be a clerk.
I had planned on going to school to become an electrical engineer. In my family at the time the only white color jobs anyone had ever had been as engineers. I went to school were some of the parents were engineers for Boeing and some of the parents worked in the factories assembling airplanes. My grandfather had worked in a factory working on airplanes. My other grandfather worked on nuclear submarines. My father and uncles, however, had driven cabs and worked in restaurants. They weren’t chefs, but worked in the kitchen. And the thing about restaurant work is that no one thought of themselves a chef who worked in a Seattle dinner preparing Crab Louie. They thought of themselves as people who lived their lives and paid the bills by working in a Seattle diner preparing Crab Louies.
The engineers, however, identified themselves as engineers as if being an engineer was an existential condition. In the spring of sophomore year contemplating a profession of being an engineer where I would be an engineer rather than a guy who works on engineering to pay the bills, I decided if that as the deal I would rather be a writer.
I had no idea what this meant. Since engineers could earn a living in the state of being an engineer couldn’t writers get buy by being in a state of being a writer?
I didn’t really realize at the time that being a writer was more like being a clerk, and I was just affirming the effectiveness of the vocational tests.
I had a vague notion that a writer was someone who wrote and somehow things like housing and food were not that big of a deal. They just came with the gig. I may have been basing this on Jack Torrance from The Shining or Garp from The World According to Garp. I had three ideas that ended up being helpful to me.
One idea was that I had to write every day and that I was beginning from nothing and would have to learn by writing if I wanted to be a writer.
The second idea was that I had to finish stories. Garp wrote a story a month while in high school. And I had read in the introduction to a collection of Ursula K. Le Guin’s firs stories, The Winds Twelve Quarters that she had written stories as a regular practice and sent them out. She wrote 40 stories before she was published.
The third idea was that I had to send my finished stories to magazines to get published. This meant I had found out where these things were and who to send them to.
Every night beginning that spring in 1987 I sat down to write. I thought about my work like it was a homework assignment, and so learned I could write 500 words with some degree of concentration. Even 500 words meant hat in a week I had a number of words that indicated a length of a story.
In a year I had finished 10 stories. In two years I had finished about 20 stories and written what I thought was a novel which was about twenty thousand words. One of the surprising aspects of this habit was that it didn’t require that much time. I could write 500 words in less than an hour. After I learned to type, I could write that in less than half an hour.
I kept writing when I enlisted in the Army Reserve and went through Boot Camp and skills training at Fort Sam Houston. I didn’t have time to write in Boot Camp, but at Fort Sam, the base library had typewriters I could use and I bought onion skin typing paper and typed on the IBM Selectric typewriter they had, 500 words or more. And then learned that onion skin is not good to type on because the ink flakes off. So I retyped my stories.
Word count was a familiar metrics to me. It was like miles are to a long distance runner, or laps to a swimmer.
I thought of the count of finished stories as proof that I was progressing toward being a writer thinking there would be a state change at some point. I would be published, and thereafter be a writer.
I also sent stories out. I found The Writer’s Digest Writer’s Market. I learned to send a story with a return postage, and began to collect rejection slips. I expected rejection slips at first and then became used to it. Early on in 1988 I got a shock when I sent a story to a local magazine edited by writer I had read about in The Seattle PI, Jessica Amanda Salmonson. She sent me back a letter and said something to the effect that based on the strength of the title of my story she had read my entire manuscript. And this had been her mistake. She suggested I get serious psychological help as quickly as possible not only for my own safety by the safety of everyone around me. My story had been called, “Leave Shatter Like Skulls.” I was reading a lot of L. Sprague de Camp, Robert Howard, Micheal Moorcock, and HP Lovecraft at the time. I was thrilled that my story had been read and that it has struck a nerve.
Based on this letter that saw my writing not as writing but as the symptom of a deranged mind I kept at for years. As a college freshman in 1992 I won a prize but not publication in STORY magazine in a year that saw a writer named Benjamin Anastas from the University of Iowa MFA program winning the first prize. The next year I published a story the Bellevue College magazine Arnazella.
It was working in that it was less of a state change from not a writer to being a writer. Being a writer was more like being a runner. While actively running, I am a runner. While putting in a regular word count, I am writer. Jack Torrance is probably a good model. I think most people may think of Torrance as a proxy for Stephen King. And in terms of making a living as a writer, who wouldn’t want to be Stephen King? But in fact it is much more like Torrance putting in words in the lobby of the Overlook Hotel, just be nice to your family and don’t go to Room 237.