Skip to content

classification

These are the stories that have been posted to the classification category.

How to determine which language(s) were used to build a .NET assembly


Published to Rick Minerich's Development Wonderland by Richard Minerich February 26, 2009 11:00

While in most cases there is no explicit information in an assembly as to which languages it was compiled from, it is possible to make an educated guess as to which languages were used.  This is due to the fact that each different .NET compiler leaves it’s own unique type of fingerprint.  In this article I discuss both my methodology for finding these fingerprints and which were unique to each language I used.

 

Methodology

For each language I made a new class library project.  I then reflected and compared each assembly to determine which unique characteristics it had.  It turned out that, at least for C#, F#, VB and C++, each was uniquely identifiable by the existence, or lack thereof, of certain features.

So to break it down a bit.

In each project I added one class and one public method in each of those classes:

  1: public class CSharpClass
  2: {
  3:     public void LocalMethod() {}
  4: }

After compiling each of these projects into it’s own assembly, I referenced them from another testing project.  To grab a set of features for each language, I used the following three reflection calls:

Assembly.GetTypes()
Assembly.GetCustomAttirbutes()
Module.GetFields( BindingFlags.NonPublic | BindingFlags.Static )

Then, with a simple program, I found which of these features were unique for each language.  This set of unique features ultimately represents a map of the imprint each compiler leaves.

 

F#

A compiled F# library will only have one attribute by default:

Microsoft.FSharp.Core.FSharpInterfaceDataVersionAttribute

This made it the easiest to differentiate of all the languages I tested.  Even more interesting, this attribute contains three fields which specify the specific version of the F# compiler used to generate the assembly:

Major    1    int
Minor    9    int
Release    6    int

I’m always impressed with how the F# team consistently goes above and beyond when it comes to the small details.

 

Visual Basic

The Visual Basic assembly I generated was also easily identifiable via extra types which were automatically added:

My.MyApplication
My.MyComputer
My.MyProject
My.MyProject+MyWebServices
My.MyProject+ThreadSafeObjectProvider`1
My.Resources.Resource
My.MySettings
My.MySettingsProperty

As you can see from this list, the existence of these types in the “My” namespace is a fairly safe indicator that the Visual Basic language was used. 

 

C++ CLI / Managed C++

C++/CLI and Managed C++ are considered to be the same language with slightly different syntax as they share the same compiler.  However, there are four different compilation modes for C++ and each has somewhat different results.

  • /CLR – Common Language Runtime Support
  • /CLR:pure – Pure Common Language Runtime Support
  • /CLR:safe – Safe Common Language Runtime Support
  • /CLR:OldSyntax – Managed C++ Syntax

The /CLR, /CLR:pure and /CLR:OldSyntax settings provide easy to classify assemblies, as they all inject an enormous number of types (70+) into the assembly.  I verified that contained two types from the vc_attributes namespace:

vc_attributes.YesNoMaybe
vc_attributes.AccessType

However, /CLR:Safe is much different in that it injects no types and adds no assembly attributes by default.  The generated assembly was almost completely clean.  I was forced to use Reflector to determine how to differentiate this from C#.

 

C#

C# was one the most difficult to identify assembly type.  This is due to the fact that it has no unique types and only one unique attribute:

System.Reflection.AssemblyConfigurationAttribute

Unfortunately, this attribute is defined in the AssemblyInfo.cs file and so we can’t depend on it.  Up to this point it was only necessary to use two reflection calls:

Assembly.GetTypes()
Assembly.GetCustomAttirbutes()

I was hoping to keep things very simple.  However, to differentance these two languages it’s necessary to go a bit further.  It turns out that C++ always injects an module level field into the assembly while C# does not.  And so by using:

Module.GetFields(BindingFlags.NonPublic | BindingFlags.Static)

We can check for the existence of this kind of field and so differentiate these two types. 

After some investigation with reflector, I was able to find one particular feature unique to C#.  Unfortunately, it requires disassembling functions and looking at the resulting IL.  It seems as though a function definition never has a .maxstack of less than 8.  In all other languages I observed .maxstack had been set to values as low as 0 when defined in an empty function. 

However, as I am only currently concerned with these four languages, my testing on this matter has been very shallow and so pleae take it with a grain of salt.

 

Conclusion

I admit that my sample Assembly set was very small and my feature set very large.  However, while this type of classification may not be robust enough to be applicable to a system which depended on these results being absolutely true, I’ve shown that it is in fact entirely possible to make reasonably confident guesses as to the language used to generate a .NET assembly while using only simple reflection.  It would be interesting to see how well this holds for obfuscated assemblies as well as other “bare minimum” compilations generated via different combinations of compiler settings.

The next obvious step would be to extend what I have already written into a full Bayesian classifier.  Would be much better than a hardcoded hierarchy which would be fragile and possibly completely and repeatedly incorrect for some cases.  Another big advantage of using machine learning here, is that it would be easy to add new features and classification categories. 

How to determine which language(s) were used to build a .NET assembly


Published to Rick Minerich's Development Wonderland by Richard Minerich February 26, 2009 11:00

While in most cases there is no explicit information in an assembly as to which languages it was compiled from, it is possible to make an educated guess as to which languages were used.  This is due to the fact that each different .NET compiler leaves it’s own unique type of fingerprint.  In this article I discuss both my methodology for finding these fingerprints and which were unique to each language I used.

 

Methodology

For each language I made a new class library project.  I then reflected and compared each assembly to determine which unique characteristics it had.  It turned out that, at least for C#, F#, VB and C++, each was uniquely identifiable by the existence, or lack thereof, of certain features.

So to break it down a bit.

In each project I added one class and one public method in each of those classes:

  1: public class CSharpClass
  2: {
  3:     public void LocalMethod() {}
  4: }

After compiling each of these projects into it’s own assembly, I referenced them from another testing project.  To grab a set of features for each language, I used the following three reflection calls:

Assembly.GetTypes()
Assembly.GetCustomAttirbutes()
Module.GetFields( BindingFlags.NonPublic | BindingFlags.Static )

Then, with a simple program, I found which of these features were unique for each language.  This set of unique features ultimately represents a map of the imprint each compiler leaves.

 

F#

A compiled F# library will only have one attribute by default:

Microsoft.FSharp.Core.FSharpInterfaceDataVersionAttribute

This made it the easiest to differentiate of all the languages I tested.  Even more interesting, this attribute contains three fields which specify the specific version of the F# compiler used to generate the assembly:

Major    1    int
Minor    9    int
Release    6    int

I’m always impressed with how the F# team consistently goes above and beyond when it comes to the small details.

 

Visual Basic

The Visual Basic assembly I generated was also easily identifiable via extra types which were automatically added:

My.MyApplication
My.MyComputer
My.MyProject
My.MyProject+MyWebServices
My.MyProject+ThreadSafeObjectProvider`1
My.Resources.Resource
My.MySettings
My.MySettingsProperty

As you can see from this list, the existence of these types in the “My” namespace is a fairly safe indicator that the Visual Basic language was used. 

 

C++ CLI / Managed C++

C++/CLI and Managed C++ are considered to be the same language with slightly different syntax as they share the same compiler.  However, there are four different compilation modes for C++ and each has somewhat different results.

  • /CLR – Common Language Runtime Support
  • /CLR:pure – Pure Common Language Runtime Support
  • /CLR:safe – Safe Common Language Runtime Support
  • /CLR:OldSyntax – Managed C++ Syntax

The /CLR, /CLR:pure and /CLR:OldSyntax settings provide easy to classify assemblies, as they all inject an enormous number of types (70+) into the assembly.  I verified that contained two types from the vc_attributes namespace:

vc_attributes.YesNoMaybe
vc_attributes.AccessType

However, /CLR:Safe is much different in that it injects no types and adds no assembly attributes by default.  The generated assembly was almost completely clean.  I was forced to use Reflector to determine how to differentiate this from C#.

 

C#

C# was one the most difficult to identify assembly type.  This is due to the fact that it has no unique types and only one unique attribute:

System.Reflection.AssemblyConfigurationAttribute

Unfortunately, this attribute is defined in the AssemblyInfo.cs file and so we can’t depend on it.  Up to this point it was only necessary to use two reflection calls:

Assembly.GetTypes()
Assembly.GetCustomAttirbutes()

I was hoping to keep things very simple.  However, to differentance these two languages it’s necessary to go a bit further.  It turns out that C++ always injects an module level field into the assembly while C# does not.  And so by using:

Module.GetFields(BindingFlags.NonPublic | BindingFlags.Static)

We can check for the existence of this kind of field and so differentiate these two types. 

After some investigation with reflector, I was able to find one particular feature unique to C#.  Unfortunately, it requires disassembling functions and looking at the resulting IL.  It seems as though a function definition never has a .maxstack of less than 8.  In all other languages I observed .maxstack had been set to values as low as 0 when defined in an empty function. 

However, as I am only currently concerned with these four languages, my testing on this matter has been very shallow and so pleae take it with a grain of salt.

 

Conclusion

I admit that my sample Assembly set was very small and my feature set very large.  However, while this type of classification may not be robust enough to be applicable to a system which depended on these results being absolutely true, I’ve shown that it is in fact entirely possible to make reasonably confident guesses as to the language used to generate a .NET assembly while using only simple reflection.  It would be interesting to see how well this holds for obfuscated assemblies as well as other “bare minimum” compilations generated via different combinations of compiler settings.

The next obvious step would be to extend what I have already written into a full Bayesian classifier.  Would be much better than a hardcoded hierarchy which would be fragile and possibly completely and repeatedly incorrect for some cases.  Another big advantage of using machine learning here, is that it would be easy to add new features and classification categories. 

A Short History of Programming Languages


Published to Rick Minerich's Development Wonderland by Richard Minerich June 11, 2009 17:42

Recently, I was reading David R. Tribble’s annotated version of Dijkstra’s famous letter “Go To Statement Considered Harmful”.  While in the process of reading, it occurred to me that I did not really understand the history of language abstraction.  To remedy this I’ve done some research and put together the following post.  I hope you find it as educational to read as I found to write.

Programming languages are often spoken of in terms of their level of abstraction.  To this end there is a somewhat official classification system.  In said system, each generation in the hierarchy represents another level of abstraction away from the machine hardware. 

 

First-generation programming language (1GL) – Binary

I think there is a world market for maybe five computers.
-Thomas Watson

It makes sense Watson would say this seeing as how the earliest computers were programmed entirely in binary.  These computers were programmed with no abstraction at all.  I, for one, do not envy our forefathers in regard to this task.  While the programs were small, all operations, data and memory had to be managed by hand in binary.

  • Introduced in the 1940s
  • Instructions/Data entered directly in binary
  • Memory must be manually moved around
  • Very difficult to edit/debug
  • Simple programs only

Examples:

Architecture specific binary delivered on Switches, Patch Panels and/or Tape.

 

Second-generation programming language (2GL) – Assembly

He who hasn't hacked assembly language as a youth has no heart. He who does as an adult has no brain.
-John Moore

Assembly languages were introduced to mitigate the error prone and excessively difficult nature of binary programming.  While still used today for embedded systems and optimization, they have mostly been supplanted by 3GL languages due to the difficulties in controlling program flow.

  • Introduced in the 1950s
  • Written by a programmer in an intermediate instruction language which is later compiled into binary instructions
  • Specific to platform architecture
  • Designed to support logical structure, debugging
  • Defined by three language elements: Opcodes (CPU Instructions), Data Sections (Variable Definitions) and Directive (Macros)

Examples: 

Almost every CPU architecture has a companion assembly language.  Most commonly in use today are RISC, CISC and x86 as that is what our embedded systems and desktop computers use.

 

Third-generation programming language (3GL) – Modern

“Real programmers can write assembly code in any language.”
-
Larry Wall

Third generation languages are the primary languages used in general purpose programming today.  They each vary quite widely in terms of their particular abstractions and syntax.  However, they all share great enhancements in logical structure over assembly language.

  • Introduced in the 1950s
  • Designed around ease of use for the programmer
  • Driven by desire for reduction in bugs, increases in code reuse
  • Based on natural language
  • Often designed with structured programming in mind

Examples:

Most Modern General Purpose Languages such as C, C++, C#, Java, Basic, COBOL, Lisp and ML. 

 

Fourth-generation programming language (4GL) – Application Specific

"A programming language is low level when its programs require attention to the irrelevant."
-Alan J. Perlis

A fourth generation language is designed with making problems in a specific domain simple to implement.  This has the advantage of greatly reducing development time cost.  At the same time there is the disadvantage of increasing developer learning cost.

  • Introduced in the 1970s, Term coined by Jim Martin
  • Driven by the need to enhance developer productivity
  • Further from the machine
  • Closer to the domain

Some examples: SQL, SAS, R, MATLAB's GUIDE, ColdFusion, CSS

 

Fifth-generation programming language (5GL) – Constraint Oriented

“There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies. The first method is far more difficult.”
- Tony Hoare

It has been argued that there is no such thing as a 5GL language.  This seems to me ridiculous as working with domain specific syntax is hardly an abstractional dead end.  This cynicism is likely a result of the many false claims of 5GL for the sake of marketing.

Many researchers speak of 5GL languages as constraint systems.  The programmer inputs a set of logical constraints, with no specified algorithm, and the AI-based compiler builds the program based on these constraints.

  • Introduced in the 1990s
  • Constraint-based instead of algorithmic
  • Used for AI Research, Proof solving, Logical Inference
  • Not in common use

Some examples: Prolog, Mercury

 

Conclusion

An interesting history lesson, although, I can’t help but feel that categories beyond 3GL are somewhat arbitrarily defined.  I do agree that 4GL is an abstraction on 3GL.  Perhaps however, there are other directions which are equally abstract in relation to 3GL.  Perhaps after concrete logic based systems, free form natural language should have been fourth.  This could be followed by thought based, which I feel would be the ultimate level of abstraction for human interaction.

Also, to my great disappointment, I was unable to find out who coined most of the “# generation language” etymologies.  As usually in computer science it is possible to gain insight on a concept by examining the author’s other works, in this case that option seems unavailable. 

 

Other References:

Introduction to Assembly Language
Generations of Programming Languages