Colloquium Report(Swati Agarwal 2011CA07) 2

11
Motilal Nehru National Institute of Technology Allahabad Colloquium 2013 MCA-V semester Title Mining for Norms in Clouds: Complying to Ethical Communication through Cloud Text Data Mining Authors Ahsan Nabi Khan, Aslam Muhammad, Martinez Enriquez A. M. Submitted By Name: Swati Agarwal Registration No.: 2011CA407

Transcript of Colloquium Report(Swati Agarwal 2011CA07) 2

Motilal Nehru National Institute of Technology

Allahabad

Colloquium 2013

MCA-V semester

TitleMining for Norms in Clouds: Complying to Ethical Communication through Cloud Text Data Mining

AuthorsAhsan Nabi Khan, Aslam Muhammad, Martinez Enriquez A. M.

Submitted ByName: Swati AgarwalRegistration No.: 2011CA407

1

Contents

1. Introduction 2

2. Objective of the application 3

3. Methodologies 43.1 Anti-Sovereign/Anarchist3.2 Theft3.3 Killing3.4 Immorality3.5 Blasphemy3.6 Contraband3.7 Judicial

4. Norms Classification and Clustering Results 7

5. Conclusion 9

6. References 10

2

1. Introduction

Some more than several thousand years ago, as the written word says were made seven ethical laws of the world symbolizing the rainbow. The laws form the norms of society in all ages then and now. With the advent of newer technologies, the aspect of legality and ethical normalcy come hand in hand. We see the internet as one of the newer spaces subject to ethical norms of society and the world in general. Controlling and regulating the Internet giants became a formidable task for most governments in recent years, and the technology moves faster than it can be tamed.

As the world is realizing the power and efficiency of cloud computing, enhanced security and intelligence is needed in communication to filter out unethical data violating norms in clouds. Numerous lists of banned, unethical and objectionable words have been developed with limited user satisfaction. Lists are usually manually generated, with some programmable extensibility for online forums and public newsgroups.

Currently clouds are under the threat of undermining security and legal violations. The internet has recently addressed the issues of anti-state website content, phishing, fraud, e-crimes, objectionable taboo material and such legally administered content.

This report defines a tool and methodology to categorize the censor data. Words grow statistically in the categorized data and tag the hidden neutral words with meaning in context. Using Computational Linguistics tools and modifying them to suit our means, thesample text is analyzed from gigabytes of email newsgroup dataset over Cloud Servers. A sample result dataset of the most frequently used words breaking the norms in recent cloud communication is presented in the results in broad categories. The categories separate cloud-server data found in newsgroups related to internet crimes, fraud, theft, anti-state elements, and other material of legal importance. Thus this report demonstrates a tag cloud of most frequent critical words in communications from legal and ethical point-of-view in the current scenario of cloud databases.

3

2. Objectives of the Application

1. To address the problem of managing ethical norms and controlling communications that use cloud databases.

2. To implement ethical constraints on the cloud data stores including the social security need-based constraints. This work may be taken as the stepping stone to a greater model for a global secure computing environment free from ethical and security issues.

3. To demonstrate and evaluate how semantics of unstructured data on the cloud can be extracted and filtered out displayed as Tag Cloud and stored into datasets for further preprocessing and text analytics.

4

3. Methodologies

1. We make use of the available list of banned words as developed by various online and governmental bodies, some online open lists and others leaked through. We use them as seed set to learn models that can classify and predict the data into categories complying with norms or not. Seed sets form tagsets; each tagset containing words of that particular category.

2. Finally, a survey of the communication data from the cloud displays the trends of most frequently used words representing topics and the most frequent tagged words based on the significance of the norm they comply or do not comply to.

3. We create a language processing tool to create a preprocessed dataset from unstructured data in the form of text. For this we customize an open source CodeProject source code for creating tag cloud from internet and local resources of text.

The project is inspired from a cloud generator Wordle (http://www.wordle.net). A demonstration of how the data collected from USENET communications over the Amazon cloud can be seen on Wordle customized cloud

Figure 1: USENET communication over the cloud created as tag cloud in Wordle

Having successfully created the capability to learn weighted Words from seven overlapping or mutually exclusive tag words using newsgroups data over the clouds, we can now select initial seed sets based on seven tagsets. We will consider the following points for selecting the seven tagsets for the implementation of secure cloud communication:

3.1 Anti-Sovereign/Anarchist: For public, private and hybrid clouds existing in various regions across the globe, a standard state-compliance policy is to be adopted for state sovereignty in order to practically coordinate state machinery with local and global cloud environments. Matters related to such state-policy matters should be traceable and available in the data storage. Such stategoverned systems can

5

smooth coordinated functioning of governments. Security threats like unexpected wikileaks and rumours of wars, hence can be moderated.

3.2 Theft: We have to test detecting cases of phishing, identity theft, intellectual property violation, and other related commercial internet crimes throughout the communication of emails, memos, etc. within SaaS. Since the companies are already investing one-third of their IT-related endeavors into Cloud Computing, the crimes of financial nature can aggravate the overall cost of employing clouds infrastructure, platform or softwares. The top theft threats are intellectual and property theft and mal-usage. A system of organization, as minor as a school and as major as a strategic capability be safe from access to untapped free money sources.

3.3 Killing: We prevent acts of terrorism and anything harmful to human life by continuously monitoring global communications over instant messaging services, and other rapid and private communication platforms. This means safety and security as well as the eyes of the watchdog. Data over the Google and Microsoft is to be combed in Cross-platform Cross-Language efforts to detect any crossborder communication in local languages indicating subversive activities. Catching the terrorists by their emails have been recently put into practice in the developed world, but is fairly needed in the underdeveloped and developing world where this is a major life-and-death concern.

3.4 Immorality: We use modest means to devise environment for the family and children using the cloud technologies. Cloud enabling filters and safe search from the popular engines can be extended for the cut-and-dried solutions for schools, universities, colleges, etc. Access to videos and materials is made easy thanks to social internet communities like Facebook.com and Youtube.com which may be the stakeholders in such coordinated SaaS environments. Putting filters on the labels of such videos is made possible using classification of parental safe and unsafe word lists.

3.5 Blasphemy: Coordinate with state-laws for the regulation of information and limits to the freedom of expression. Most of the Muslim majority states have laws against antistate and anti-religious expressions which require state regulatory authorities like Pakistan Telecommunications Authority and Lahore High Court to detect and ban the objectionable terms in communications and the media. Using intelligent clustering techniques similar to spamfilter technologies, the cloud may coordinate further for custom-made laws for each cloud culture. Asian societies because of their richness in cultural diversity and ethnicity need further to customize such category words to suit their specific region lore and mores.

3.6 Contraband: We have to keep the cloud from selling and buying or dealing in illegal consumption items. This includes but is not limited to contraband drugs, radioactive and other harmful materials. Communication that lists such items is to be detected and alerted.

6

3.7 Judicial: For enabling communication that passes the legal and ethical standards, we use international law for implementing the universally agreed seven ethical laws and ensuring coordination of cloud-enabled technologies for the purpose. All the above six problems can be solved starting from implementing the filters on natural language that contains hints to the above and related terms and concepts. Semantic Tags over the Web would be needed to classify information into these categories.

7

4. Norms Classification and Clustering Results

1. The results of running five years USENET newsgroups textual data from the Amazon cloud dataset on our TagCoud modified CodeProject, we see the following words spread

Figure 2: Wordle Tag Cloud by occurrences without sense filter on sample data from march 2011

As we see from the list of words, the most frequently used words were Email Address, Apr, Usenet, Wrote, people and others which are neutral and does not dig up the objectionablecontent.

2. After training the same text in our modified tool for TagCloud CodeProject, we see more relevant words, almost all tagged in the first category of Anti-State Anarchy. Surprisingly,our much smaller tagset on Killing also tagged a frequently used word floor, which happened to co-occur with such words of the tagset, and it correctly tagged it.

Figure 3: TagCloud results on sample data from March, 2011

3. As we give the training the second round, (see Figure 4) we cluster more words related to the same category Anarchist, and finally come with a main super tagword “Free” with numerous learned words supporting the same. What more we see is that

8

the last and seventh Tagset Category of Judicial is also detected in the newsgroup cluster, but only on the outskirt borderline of the main cluster of Anarchist. It meant the learning naturally and correctly formed one main category of words related to Anarchy and some borderline minority of words related to Judicial tagset.

Figure 4: TagCloud results on sample data from March, 2011(2nd Iteration)

9

5. Conclusion

As the world is already concerned about filtering content, with some resourceful countries already using tools to achieve the purpose, and other countries lagging behind in responsibly sharing the cloud-enabled world wide web, words in use hold the key. With words we can sense expression; similarly with words we can sustain expression. Words hold the key to contain Anti-state, theft, killing, immorality, blasphemy, contraband consumption, and such extra-judicial activities.

What we see over the USENET cloud data servers is traces of unearthed hate material, anarchist views exchange until the last observation in March 2011. Even though most of the Tag Clouds will not have power to do semantic tagging of worldwide cases as such, the viability of using such Text Data Mining Tools as we demonstrated above exist. Each Semantic Category may learn its own relevance of words through our Computational Linguistics model using Stemmer, and Hidden Markov Model Sensor and produce datasets of Key-Value Pairs to be used and reused for cloud clusters. Such implementation over the platforms and softwares as services over the clouds can enable to learn and filter out content and ethically grow the norms of a text data mining cloud server platform.

Governments can and some already have taken up the challenge of putting reins on this apocalyptic horse for a smoother functioning of political, financial, social and international machinery. Removing the six categories of words ensure that the seventh category of judicially sanctioned words maintain in all the world-wide-web over the age of the clouds.

10

6. References

1. Ahsan Nabi Khan, Aslam Muhammad, Martinez Enriquez A. M. “Mining for Norms in Clouds: Complying to Ethical Communication through Cloud Text Data Mining” 2012 IEEE.

2. Cohen, Yakov Dovid. "Divine Image " Insights into the Laws of Noah, published by The Institute of Noahide Code 2006 ISBN 1-4243-10008 online www.Noahide.org

3. Ian Foster and Carl Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco, California, 2004.

4. Hillol Kargupta. Proceedings of Next Generation Data Mining 2007. Taylor and Francis, 2008.

5. Lawrence Lessig. Code and Other Laws of Cyberspace (New York:Basic Books, 1999, 2006).